Reading in Chunks
Reading in Chunks¶
A final feature that can be a lifesaver is the ability to read in chunks. What this does is allows you to take a large dataset and only read in pieces at a time to make it more manageable. Let’s begin with creating an example file with 100 rows (in reality we would only do this with a large dataset, but this is just to show the example).
#Create some dummy data
test_data = pd.DataFrame([[x, x**2, x*5] for x in range(1, 101)])
test_data.to_csv("TestData.csv")
The simple objective is to get the sum of all values in the dataframe. In our example, you can image that instead of 100 values we have 100 million+ values and so we may not be able to read the dataset into our computer depending on how much memory the computer has. An important thing to ask yourself is whether or not what you are trying to achieve can be done in chunks.... there are times that you can only run operations on the full dataset for one reason or another and if that is the case your best bet is moving to a database.
If you give the chunksize argument with a number of rows to read in each time, you will be able to piece the reading into parts. The code below will chunk the dataframe into sets of 10 rows and then we can loop through each chunk to get the sum.
chunks = pd.read_csv("TestData.csv", index_col=0,chunksize=10)
total = 0
for chunk in chunks:
s = chunk.sum().sum()
print("Sum of current chunk: {}".format(s))
print()
total += s
print("Total sum: {}".format(total))