Lazy Method for Reading Big File in Python?

The Question :

308 people think this question is useful

I have a very big file 4GB and when I try to read it my computer hangs. So I want to read it piece by piece and after processing each piece store the processed piece into another file and read next piece.

Is there any method to yield these pieces ?

I would love to have a lazy method.

The Question Comments :

The Answer 1

455 people think this answer is useful

To write a lazy function, just use yield:

def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data


with open('really_big_file.dat') as f:
    for piece in read_in_chunks(f):
        process_data(piece)


Another option would be to use iter and a helper function:

f = open('really_big_file.dat')
def read1k():
    return f.read(1024)

for piece in iter(read1k, ''):
    process_data(piece)


If the file is line-based, the file object is already a lazy generator of lines:

for line in open('really_big_file.dat'):
    process_data(line)

The Answer 2

41 people think this answer is useful

If your computer, OS and python are 64-bit, then you can use the mmap module to map the contents of the file into memory and access it with indices and slices. Here an example from the documentation:

import mmap
with open("hello.txt", "r+") as f:
    # memory-map the file, size 0 means whole file
    map = mmap.mmap(f.fileno(), 0)
    # read content via standard file methods
    print map.readline()  # prints "Hello Python!"
    # read content via slice notation
    print map[:5]  # prints "Hello"
    # update content using slice notation;
    # note that new content must have same size
    map[6:] = " world!\n"
    # ... and read again using standard file methods
    map.seek(0)
    print map.readline()  # prints "Hello  world!"
    # close the map
    map.close()

If either your computer, OS or python are 32-bit, then mmap-ing large files can reserve large parts of your address space and starve your program of memory.

The Answer 3

39 people think this answer is useful

file.readlines() takes in an optional size argument which approximates the number of lines read in the lines returned.

bigfile = open('bigfilename','r')
tmp_lines = bigfile.readlines(BUF_SIZE)
while tmp_lines:
    process([line for line in tmp_lines])
    tmp_lines = bigfile.readlines(BUF_SIZE)

The Answer 4

32 people think this answer is useful

There are already many good answers, but if your entire file is on a single line and you still want to process “rows” (as opposed to fixed-size blocks), these answers will not help you.

99% of the time, it is possible to process files line by line. Then, as suggested in this answer, you can to use the file object itself as lazy generator:

with open('big.csv') as f:
    for line in f:
        process(line)

However, I once ran into a very very big (almost) single line file, where the row separator was in fact not '\n' but '|'.

  • Reading line by line was not an option, but I still needed to process it row by row.
  • Converting'|' to '\n' before processing was also out of the question, because some of the fields of this csv contained '\n' (free text user input).
  • Using the csv library was also ruled out because the fact that, at least in early versions of the lib, it is hardcoded to read the input line by line.

For these kind of situations, I created the following snippet:

def rows(f, chunksize=1024, sep='|'):
    """
    Read a file where the row separator is '|' lazily.

    Usage:

    >>> with open('big.csv') as f:
    >>>     for r in rows(f):
    >>>         process(row)
    """
    curr_row = ''
    while True:
        chunk = f.read(chunksize)
        if chunk == '': # End of file
            yield curr_row
            break
        while True:
            i = chunk.find(sep)
            if i == -1:
                break
            yield curr_row + chunk[:i]
            curr_row = ''
            chunk = chunk[i+1:]
        curr_row += chunk

I was able to use it successfully to solve my problem. It has been extensively tested, with various chunk sizes.


Test suite, for those who want to convince themselves.

test_file = 'test_file'

def cleanup(func):
    def wrapper(*args, **kwargs):
        func(*args, **kwargs)
        os.unlink(test_file)
    return wrapper

@cleanup
def test_empty(chunksize=1024):
    with open(test_file, 'w') as f:
        f.write('')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 1

@cleanup
def test_1_char_2_rows(chunksize=1024):
    with open(test_file, 'w') as f:
        f.write('|')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 2

@cleanup
def test_1_char(chunksize=1024):
    with open(test_file, 'w') as f:
        f.write('a')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 1

@cleanup
def test_1025_chars_1_row(chunksize=1024):
    with open(test_file, 'w') as f:
        for i in range(1025):
            f.write('a')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 1

@cleanup
def test_1024_chars_2_rows(chunksize=1024):
    with open(test_file, 'w') as f:
        for i in range(1023):
            f.write('a')
        f.write('|')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 2

@cleanup
def test_1025_chars_1026_rows(chunksize=1024):
    with open(test_file, 'w') as f:
        for i in range(1025):
            f.write('|')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 1026

@cleanup
def test_2048_chars_2_rows(chunksize=1024):
    with open(test_file, 'w') as f:
        for i in range(1022):
            f.write('a')
        f.write('|')
        f.write('a')
        # -- end of 1st chunk --
        for i in range(1024):
            f.write('a')
        # -- end of 2nd chunk
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 2

@cleanup
def test_2049_chars_2_rows(chunksize=1024):
    with open(test_file, 'w') as f:
        for i in range(1022):
            f.write('a')
        f.write('|')
        f.write('a')
        # -- end of 1st chunk --
        for i in range(1024):
            f.write('a')
        # -- end of 2nd chunk
        f.write('a')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 2

if __name__ == '__main__':
    for chunksize in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]:
        test_empty(chunksize)
        test_1_char_2_rows(chunksize)
        test_1_char(chunksize)
        test_1025_chars_1_row(chunksize)
        test_1024_chars_2_rows(chunksize)
        test_1025_chars_1026_rows(chunksize)
        test_2048_chars_2_rows(chunksize)
        test_2049_chars_2_rows(chunksize)

The Answer 5

12 people think this answer is useful
f = ... # file-like object, i.e. supporting read(size) function and 
        # returning empty string '' when there is nothing to read

def chunked(file, chunk_size):
    return iter(lambda: file.read(chunk_size), '')

for data in chunked(f, 65536):
    # process the data

UPDATE: The approach is best explained in https://stackoverflow.com/a/4566523/38592

The Answer 6

7 people think this answer is useful

Refer to python’s official documentation https://docs.python.org/3/library/functions.html#iter

Maybe this method is more pythonic:

from functools import partial

"""A file object returned by open() is a iterator with
read method which could specify current read's block size"""
with open('mydata.db', 'r') as f_in:

    part_read = partial(f_in.read, 1024*1024)
    iterator = iter(part_read, b'')

    for index, block in enumerate(iterator, start=1):
        block = process_block(block)    # process your block data
        
        with open(f'{index}.txt', 'w') as f_out:
            f_out.write(block)

The Answer 7

3 people think this answer is useful

I think we can write like this:

def read_file(path, block_size=1024): 
    with open(path, 'rb') as f: 
        while True: 
            piece = f.read(block_size) 
            if piece: 
                yield piece 
            else: 
                return

for piece in read_file(path):
    process_piece(piece)

The Answer 8

3 people think this answer is useful

In Python 3.8+ you can use .read() in a while loop:

with open("somefile.txt") as f:
    while chunk := f.read(8192):
        do_something(chunk)

Of course, you can use any chunk size you want, you don’t have to use 8192 (2**13) bytes. Unless your file’s size happens to be a multiple of your chunk size, the last chunk will be smaller than your chunk size.

The Answer 9

2 people think this answer is useful

i am not allowed to comment due to my low reputation, but SilentGhosts solution should be much easier with file.readlines([sizehint])

python file methods

edit: SilentGhost is right, but this should be better than:

s = "" 
for i in xrange(100): 
   s += file.next()

The Answer 10

1 people think this answer is useful

I’m in a somewhat similar situation. It’s not clear whether you know chunk size in bytes; I usually don’t, but the number of records (lines) that is required is known:

def get_line():
     with open('4gb_file') as file:
         for i in file:
             yield i

lines_required = 100
gen = get_line()
chunk = [i for i, j in zip(gen, range(lines_required))]

Update: Thanks nosklo. Here’s what I meant. It almost works, except that it loses a line ‘between’ chunks.

chunk = [next(gen) for i in range(lines_required)]

Does the trick w/o losing any lines, but it doesn’t look very nice.

The Answer 11

0 people think this answer is useful

To process line by line, this is an elegant solution:

  def stream_lines(file_name):
    file = open(file_name)
    while True:
      line = file.readline()
      if not line:
        file.close()
        break
      yield line

As long as there’re no blank lines.

The Answer 12

-2 people think this answer is useful

you can use following code.

file_obj = open('big_file') 

open() returns a file object

then use os.stat for getting size

file_size = os.stat('big_file').st_size

for i in range( file_size/1024):
    print file_obj.read(1024)

Add a Comment