Convert A Text Of Binary Values To Numpy File
Solution 1:
You create many bin-files with pickle and you have some code that loads and unloads the different part of your data.
Say you have a file that is 16GB, you can create 16 1GB pickle files.
If you say you have enough RAM, after the pickle files are done, you should be able to load it all in memory.
Solution 2:
As far as I can tell, your approach of reading the file is already quite memory efficient.
I assume that getting a file object with open
will not read the whole file from the file system into RAM, but instead access the file on the file system as-needed.
You then iterate over the file object's, which yield
s the file's lines (strings in your case, as you've opened the file in text mode) i.e., the file object acts as a generator. Thus one can assume that no list of all lines is constructed here and that the lines are read one by one to be continuously consumed.
You do this in a list contraction. Do list contractions collect all values yielded by their right hand side (the part after the in
keyword) before passing it to their left hand side (the part before the for
keyword) for processing? A little experiment can tell us:
print('defining generator function')
def firstn(n):
num = 0
while num < n:
print('yielding ' + str(num))
yield num
num += 1
print('--')
[print('consuming ' + str(i)) for i in firstn(5)]
The output of the above is
defining generator function
--
yielding 0
consuming 0
yielding 1
consuming 1
yielding 2
consuming 2
yielding 3
consuming 3
yielding 4
consuming 4
So the answer is no, each yielded value is immediately consumed by the left hand side before any other values are yielded from the right hand side. Only one line from the file should have to be kept in memory at a time.
So if the individual lines in your file aren't too long, your reading approach seems to be as memory efficient as it gets.
Off course, your list contraction still has to collect the results of the left hand side's processing. After all, the resulting list is what you want to get out of all this. So if you run out of memory, it is likely that it's the resulting list becoming too large.
I don't know if NumPy uses the fact that collections of booleans can be stored more efficiently than numbers. But if it does, you'd have to make it aware that your integers are, in fact, boolean-valued to benefit from that more memory-efficient data types:
import numpy as np
f = open ( "data.txt" , 'r')
converted_data = [ np.fromstring(line, dtype=bool, sep=',') for line in f ]
If you don't need all of converted_data
at once, but rather have to be able to iterate over it, consider making it a generator, too, instead of a list. You don't need to much around with the yield
keyword to achieve that. Simply replace the square brackets of the list comprehension by round braces and you've got a generator expression:
converted_data_generator = ( np.fromstring(line, dtype=bool, sep=',') for line in f )
Post a Comment for "Convert A Text Of Binary Values To Numpy File"