read multiple large csv files into array in parallel


with help of some tutorials and some guys here on stackoverflow, I’ve managed to stitch together this basic parallel(multiple files) csv -> array reader. Can I get it any faster? I’ve read here and there about possibility to preload files into memory, optimize threads somehow, or maybe do some parts in cuda (with which I have some small experience with)? but have no clue what the next step should be. Any suggestions to make this faster?:

// parallel-matrix-multiply.cpp
// compile with: /EHsc
#include <windows.h>
#include <ppl.h>
#include <iostream>
#include <random>

using namespace concurrency;    
using namespace std;

#include <fstream>
#include <sstream>

int main()
    int numRows = 360;
    int numCols = 4096;

    int** data = new int*[numRows * 120];
    for (int i = 0; i < numRows * 120; i++) {
        data[i] = new int[numCols];

    clock_t starttimetotal = clock();
    char comma; // Just a place holder to store the commas
    char newLine; // Just a place holder to store the newlines

    int m = 120; //120 files of same format

    Concurrency::parallel_for(0, m,
        [&numCols, &numRows, &comma, &newLine, &data](int i) {

        std::ifstream in("C:/codeoutput/output_" + std::to_string(i + 1) + ".txt");

        for (int row = 0; row < numRows; row++) {
            for (int col = 0; col < numCols; col++)
                // Grab Data for the cell in (row,col)
                in >> data[i * 360 + row][col];
                // If this is not the last column grab the comma between the values
                if (col < numCols - 1) {
                    in >> comma;
            in >> newLine; // Grab the remaining newLine character

    clock_t stoptotal = clock();
    double elapsed = (double)(stoptotal - starttimetotal) * 1000.0 / CLOCKS_PER_SEC;
    printf("Time elapsed in ms: %f\n", elapsed);

    return 0;


Possibly better suited for code review. A simple thing you could do is switch from text files to binary files. Then you’d save time from text parsing and also eliminate the if statement from inside your loop, which will have a positive effect on performance.
You could also switch from using an int** (2D, area A) to using an int*(1D, length A), and doing your row/column -> index calculations by hand. This will reduce the number of main-memory lookups you need to do (and might let the CPU store more in the cache).
A couple of things: You have data-races to comma and newLine. You could also move the allocation-loop to inside the parallel part. You could also move the whole file into memory using the low-level platform-dependent I/O functions, and then go through the memory to extract the data you need.