Convert A Text Of Binary Values To Numpy File

October 30, 2022 Post a Comment

How can one convert a huge text file (>16G) containing binary-valued characters (0 and 1) to a numpy array file without blowing up the memory in python? Assuming we have enough

Solution 1:

You create many bin-files with pickle and you have some code that loads and unloads the different part of your data.

Say you have a file that is 16GB, you can create 16 1GB pickle files.

If you say you have enough RAM, after the pickle files are done, you should be able to load it all in memory.

Solution 2:

As far as I can tell, your approach of reading the file is already quite memory efficient.

I assume that getting a file object with open will not read the whole file from the file system into RAM, but instead access the file on the file system as-needed.

You then iterate over the file object's, which yields the file's lines (strings in your case, as you've opened the file in text mode) i.e., the file object acts as a generator. Thus one can assume that no list of all lines is constructed here and that the lines are read one by one to be continuously consumed.

You do this in a list contraction. Do list contractions collect all values yielded by their right hand side (the part after the in keyword) before passing it to their left hand side (the part before the for keyword) for processing? A little experiment can tell us:

print('defining generator function')

def firstn(n):
        num = 0
        while num < n:
                print('yielding ' + str(num))
                yield num
                num += 1

print('--')

[print('consuming ' + str(i)) for i in firstn(5)]

The output of the above is

defining generator function
--
yielding 0
consuming 0
yielding 1
consuming 1
yielding 2
consuming 2
yielding 3
consuming 3
yielding 4
consuming 4

So the answer is no, each yielded value is immediately consumed by the left hand side before any other values are yielded from the right hand side. Only one line from the file should have to be kept in memory at a time.

So if the individual lines in your file aren't too long, your reading approach seems to be as memory efficient as it gets.

Off course, your list contraction still has to collect the results of the left hand side's processing. After all, the resulting list is what you want to get out of all this. So if you run out of memory, it is likely that it's the resulting list becoming too large.

I don't know if NumPy uses the fact that collections of booleans can be stored more efficiently than numbers. But if it does, you'd have to make it aware that your integers are, in fact, boolean-valued to benefit from that more memory-efficient data types:

import numpy as np
f = open ( "data.txt" , 'r')
converted_data = [ np.fromstring(line, dtype=bool, sep=',') for line in f ]

If you don't need all of converted_data at once, but rather have to be able to iterate over it, consider making it a generator, too, instead of a list. You don't need to much around with the yield keyword to achieve that. Simply replace the square brackets of the list comprehension by round braces and you've got a generator expression:

converted_data_generator = ( np.fromstring(line, dtype=bool, sep=',') for line in f )

Python stackoverflow Examples

Convert A Text Of Binary Values To Numpy File

Solution 1:

Solution 2:

Post a Comment for "Convert A Text Of Binary Values To Numpy File"