Reading in Chunks

Reading in Chunks¶

A final feature that can be a lifesaver is the ability to read in chunks. What this does is allows you to take a large dataset and only read in pieces at a time to make it more manageable. Let’s begin with creating an example file with 100 rows (in reality we would only do this with a large dataset, but this is just to show the example).

In [16]:

#Create some dummy data
test_data = pd.DataFrame([[x, x**2, x*5] for x in range(1, 101)])
test_data.to_csv("TestData.csv")

The simple objective is to get the sum of all values in the dataframe. In our example, you can image that instead of 100 values we have 100 million+ values and so we may not be able to read the dataset into our computer depending on how much memory the computer has. An important thing to ask yourself is whether or not what you are trying to achieve can be done in chunks.... there are times that you can only run operations on the full dataset for one reason or another and if that is the case your best bet is moving to a database.

If you give the chunksize argument with a number of rows to read in each time, you will be able to piece the reading into parts. The code below will chunk the dataframe into sets of 10 rows and then we can loop through each chunk to get the sum.

In [17]:

chunks = pd.read_csv("TestData.csv", index_col=0,chunksize=10)
total = 0
for chunk in chunks:
    s = chunk.sum().sum()
    print("Sum of current chunk: {}".format(s))
    print()
    total += s
print("Total sum: {}".format(total))

Sum of current chunk: 715

Sum of current chunk: 3415

Sum of current chunk: 8115

Sum of current chunk: 14815

Sum of current chunk: 23515

Sum of current chunk: 34215

Sum of current chunk: 46915

Sum of current chunk: 61615

Sum of current chunk: 78315

Sum of current chunk: 97015

Total sum: 368650

Data Science

Data Science

Reading in Chunks

Reading in Chunks¶

Leave A Reply Cancel reply

Modal title