-
Pandas Basics 5
-
Lecture1.1
-
Lecture1.2
-
Lecture1.3
-
Lecture1.4
-
Lecture1.5
-
-
Data Transformations 6
-
Lecture2.1
-
Lecture2.2
-
Lecture2.3
-
Lecture2.4
-
Lecture2.5
-
Lecture2.6
-
-
Statistics 4
-
Lecture3.1
-
Lecture3.2
-
Lecture3.3
-
Lecture3.4
-
-
Reading and Writing Data 3
-
Lecture4.1
-
Lecture4.2
-
Lecture4.3
-
-
Joins 5
-
Lecture5.1
-
Lecture5.2
-
Lecture5.3
-
Lecture5.4
-
Lecture5.5
-
-
Grouping 4
-
Lecture6.1
-
Lecture6.2
-
Lecture6.3
-
Lecture6.4
-
-
Introduction to Numpy 4
-
Lecture7.1
-
Lecture7.2
-
Lecture7.3
-
Lecture7.4
-
-
Randomness 2
-
Lecture8.1
-
Lecture8.2
-
-
Numpy Data Functionality 1
-
Lecture9.1
-
Reading in Chunks
Reading in Chunks¶
A final feature that can be a lifesaver is the ability to read in chunks. What this does is allows you to take a large dataset and only read in pieces at a time to make it more manageable. Let’s begin with creating an example file with 100 rows (in reality we would only do this with a large dataset, but this is just to show the example).
#Create some dummy data
test_data = pd.DataFrame([[x, x**2, x*5] for x in range(1, 101)])
test_data.to_csv("TestData.csv")
The simple objective is to get the sum of all values in the dataframe. In our example, you can image that instead of 100 values we have 100 million+ values and so we may not be able to read the dataset into our computer depending on how much memory the computer has. An important thing to ask yourself is whether or not what you are trying to achieve can be done in chunks.... there are times that you can only run operations on the full dataset for one reason or another and if that is the case your best bet is moving to a database.
If you give the chunksize argument with a number of rows to read in each time, you will be able to piece the reading into parts. The code below will chunk the dataframe into sets of 10 rows and then we can loop through each chunk to get the sum.
chunks = pd.read_csv("TestData.csv", index_col=0,chunksize=10)
total = 0
for chunk in chunks:
s = chunk.sum().sum()
print("Sum of current chunk: {}".format(s))
print()
total += s
print("Total sum: {}".format(total))
Sum of current chunk: 715
Sum of current chunk: 3415
Sum of current chunk: 8115
Sum of current chunk: 14815
Sum of current chunk: 23515
Sum of current chunk: 34215
Sum of current chunk: 46915
Sum of current chunk: 61615
Sum of current chunk: 78315
Sum of current chunk: 97015
Total sum: 368650