Chunk File Pandas

A solution to this is to process an entire data source chunk by chunk, instead of a single go all at once. The Pandas modules uses objects to allow for data analysis at a fairly high performance rate in comparison to typical Python procedures. rdb) as a Pandas DataFrame. Pandas is great for data manipulation, data analysis, and data visualization. An example might be a. Now, is there a way to preserve index during the normalization process? $\endgroup$ - Sany Dec 1 '18 at 23:57. get_schema taken from open source projects. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. Neither of these approaches solves the aforementioned problems, as they don't give us a small randomised sample of the data straight away. Reading and Writing the Apache Parquet Format¶. 20 Dec 2017. While the function is equivalent to SQL's UNION clause, there's a lot more that can be done with it. Python in R Markdown — A new Python language engine for R Markdown that supports bi-directional communication between R and Python (R chunks can access Python objects and vice-versa). Extracting data from VCF files. Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. Also, note that put auto-compresses files by default before uploading and supports threaded uploads. The web service then breaks the chunk of data up into several smaller pieces, each will the header (first line of the chunk). Writing an iterator to load data in chunks (2) In the previous exercise, you used read_csv() to read in DataFrame chunks from a large dataset. With it, we can easily read and write from and to CSV files, or even databases. Pandas can, of course, also be used to load a SPSS file into a dataframe. In practice, it's often easiest simply to use chunks() all the time. When faced with such situations (loading & appending multi-GB csv files), I found @user666's option of loading one data set (e. header to tell pandas the. You'll want to be able to import the data you're interested in as a collection of DataFrames and combine them to answer your central questions. They are extracted from open source Python projects. Surprised no one has mentioned dask. In this exercise, you will do just that. These are recorded and executed per data chunk, so large files can be processed with limited memory using the 'LaF' package. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. Introduces Python, pandas, Anaconda, Jupyter Notebook, and the course prerequisites; Explores sample Jupyter Notebooks to showcase the power of pandas for data analysis; The pandas. In the example above, the for loop retrieves the whole csv file in four chunks. This line of code pd. Using Arrow for this is being working on in SPARK-20791 and should give similar performance improvements and make for a very efficient round-trip with Pandas. csv', sep = ';', skipinitialspace = True) If the padding white spaces occur on both sides of the cell values we need to use a regular expression separator. The simplest way to convert a pandas column of data to a different type is to use astype(). gk13 changed the title Pandas readcsv out of memory even after adding chunksize Pandas read_csv out of memory even after adding chunksize May 30, 2017. Importing Python modules — The import() function enables you to import any Python module and call its functions directly from R. They are extracted from open source Python projects. Simple, expressive and arguably one of the most important libraries in Python, not only does it make real-world Data Analysis significantly easier but provides an optimized feature of being significantly fast. In practice, it's often easiest simply to use chunks() all the time. In this exercise, you will read in a file using a bigger DataFrame chunk size and then process the data from the first chunk. This is part of my web service: the user uploads a CSV file, the web service will see this CSV is a chunk of data--it does not know of any file, just the contents. concat takes a list of Series or DataFrames and returns a Series or DataFrame of the concatenated objects. It may be spread across a number of text files, spreadsheets, or databases. Now, let us extract car records having 6 cylinders. To Split String in Python using delimiter, you can use String. zip attachment with the working files for this course is attached to this lesson. In this article we will read excel files using Pandas. read_sql_query(). For the second chunk onwards, the chunk dataframe index starts with chunk index(i. simpledbf is a Python library for converting basic DBF files (see Limitations) to CSV files, Pandas DataFrames, SQL tables, or HDF5 tables. If you saved a reference to the file object, just call "seek(0)" on that object. The problem happens when reading it back. concat takes a list of Series or DataFrames and returns a Series or DataFrame of the concatenated objects. the pandas. The coomand above works fine with a 8 GB file, but pandas crashes for my 34 GB file, subsequently crashing my iPython notebook. The solution is to parse csv files in chunks and append only the needed rows to our dataframe. Filed Under: Pandas DataFrame, Python, Python Tips, read_csv in Pandas Tagged With: load a big file in chunks, pandas chunksize, Pandas Dataframe, Python Tips Subscribe to Blog via Email Enter your email address to subscribe to this blog and receive notifications of new posts by email. I had to install typing and dask as well. html file, and place them appropriately in the editor. Due to each chunk being stored in a separate file, it is ideal for parallel access in both reading and writing (for the latter, if the Dask array chunks are aligned with the target). Examples to split string using delimiter, split to specific number of chunks, spaces as delimiter, etc. First of all, you need to load your large dataset. Before reading a subset of data from a file, we run a command to flush and clear all the disk caches in memory, so running the timing repeatedly yields nearly the same time. Let's break down code chunks in. Truly, what Matt Rocklin and team have built is an excellent piece of kit. Splitting is a process that keeps chunks from growing too large. Note, however, we need to install the Pyreadstat package as, at least right now, Pandas depends on this for reading. Lets now try to understand what are the different parameters of pandas read_csv and how to use them. Each timing uses a different set of chunks, so we are not exploiting chunk caching. Obviously that large of a file can not possibly be read into memory all at once, so that is not an option. Support is provided through the pyarrow package, which can be installed via conda or pip. TextFileReader. As always, we need to import Pandas as pd: import pandas as pd. And indexes are immutable, so each time you append pandas has to create an entirely new one. You can vote up the examples you like or vote down the ones you don't like. Since only one chunk is loaded at a time, the peak memory usage has come down to 7K, compared 28K when we load the full csv. In the specific case:. To handle this large file, we need to cut CSV file into many chunks and process each of them. I think the default in pandas is to read 1,000,000 rows before guessing the dtype. Python | Using Pandas to Merge CSV Files. A single thread can upload multiple chunks. Break a list into chunks of size N in Python Method 1: Using yield The yield keyword enables a function to comeback where it left off when it is called again. I've used it to handle tables with up to 100 million rows. Load Excel Spreadsheet As pandas Dataframe. Operations like groupby, join, and set_index have special performance considerations that are different from normal Pandas due to the parallel, larger-than-memory, and distributed nature of Dask DataFrame. You can also save this page to your account. There are some Pandas DataFrame manipulations that I keep looking up how to do. Introduces Python, pandas, Anaconda, Jupyter Notebook, and the course prerequisites; Explores sample Jupyter Notebooks to showcase the power of pandas for data analysis; The pandas. I've used it to handle tables with up to 100 million rows. Break a list into chunks of size N in Python Method 1: Using yield The yield keyword enables a function to comeback where it left off when it is called again. In our main task, we set chunksize as 200,000, and it used 211. Pandas - How to read text files delimited with fixed widths With Python Pandas library it is possible to easily read fixed width text files, for example: In this case, the text file has its first 4 lines without data and the 5th line with the header. Goal 1- Load the Social Security data about baby names. Now, let us extract car records having 6 cylinders. File path or object, if None is provided the result is returned as a string. Skip to the end of the chunk. The following are code examples for showing how to use pandas. The first parameter is csv_file for the filename, the second is c_size for the chunk size, and the last is colname for the column name. I have a large input file ~ 12GB, I want to run certain checks/validations like, count, distinct columns, column type , and so on. Note that because the function takes list, you can. An example might be a. Python iterators loading data in chunks with pandas [xyz-ihs snippet="tool2"]. up vote 6 down vote favorite 2 I am processing a csv-file which is 2. However, in case of BIG DATA CSV files, it provides functions that accept chunk size to read big data in smaller chunks. By voting up you can indicate which examples are most useful and appropriate. Here are the examples of the python api pandas. If so, try rewinding the file object that you passed to pd. All further calls to read() for the chunk will return b''. In order to do this with the subprocess library, one would execute following shell command:. To verify that this is a compression issue, would it be possible for you to generate this file as an uncompressed SAS file and see if the issue arises still? …. read_csv(), passing c_size to chunksize. sav7bdat file into a Pandas dataframe but by using Pandas read_sas method, instead. So, I have introduced with you how to read CSV file in pandas in short tutorial, along with common-use parameters. get_chunk returns full file DataFram despite of chunksize specified in read_csv #3406 Closed vshkolyar opened this issue Apr 20, 2013 · 4 comments. For the second chunk onwards, the chunk dataframe index starts with chunk index(i. If pandas were to read the above csv file without any dtype option, the age would be stored as strings in memory until pandas has read enough lines of the csv file to make a qualified guess. If so, try rewinding the file object that you passed to pd. This package is fully compatible with Python >=3. If you use pandas read large file into chunk and then yield row by row, here is what I have done read line by line via chunks. I had to install typing and dask as well. By file-like object, we refer to objects with a read() method, such as a file handler (e. pandas read_csv tutorial. (But of course the read takes much longer. import modules. If you want to pass in a path object, pandas accepts any os. Aggregation is the process of turning the values of a dataset (or a subset of it) into one single value. The function below named plot_chunk_inputs() takes the data in chunk format and a list of chunk ids to plot. Chunking in Python---How to set the "chunk size" of read lines from file read with Python open()? I have a fairly large text file which I would like to run in chunks. The behavior of basic iteration over Pandas objects depends on the type. Working with large JSON datasets can be a pain, particularly when they are too large to fit into memory. pdf in the current working directory and open it for writing. Pandas Tutorial 1: Pandas Basics (Reading Data Files, DataFrames, Data Selection) Data aggregation - in theory. Introduces Python, pandas, Anaconda, Jupyter Notebook, and the course prerequisites; Explores sample Jupyter Notebooks to showcase the power of pandas for data analysis; The pandas. ) This reinforces that the behavior above is a bug. This happens because pandas and numpy would need to allocate contiguous memory blocks, and 32-bit system would have a cap at 2GB. the dot product of an array a with iteself transposed. Pandas handle data from 100MB to 1GB quite efficiently and give an exuberant performance. Parameter Description; path_or_buf: string or file handle, default None File path or object, if None is provided the result is returned as a string. As always, we need to import Pandas as pd: import pandas as pd. Whether it's writing to a simple text file, reading a complicated server log, or even analyzing raw byte data, all of these situations require reading or writing a file. Python has methods for dealing with CSV files, but in this entry, I will only concentrate on Pandas. The 12 seconds to repack the file, changing the chunk size in the dataset to be read, was much less than the original read time. Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. concat(chunk for chunk in csv_chunks) File "C:Program FilesPythonAnacondalibsite-packagespandastoolsmerge. We examine the comma-separated value format, tab-separated files, FileNotFound errors, file extensions, and Python paths. read_excel(). The method read_excel() reads the data into a Pandas Data Frame, where the first parameter is the filename and the second parameter is the sheet. In this tutorial you're going to learn how to work with large Excel files in Pandas, focusing on reading and analyzing an xls file and then working with a subset of the original data. Begin Python chunks with ```{python}. The C engine is "filling in the blanks" thanks to the names parameter that you passed in, so while I'm still wary of the jagged CSV format, pandas is a little more generous than I recalled. If the separator between each field of your data is not a comma, use the sep argument. Dataset¶ Coerce all arrays in this dataset into dask arrays with the given chunks. For the second chunk onwards, the chunk dataframe index starts with chunk index(i. Now, is there a way to preserve index during the normalization process? $\endgroup$ - Sany Dec 1 '18 at 23:57. In the specific case:. A tiny, subprocess-based tool for reading a MS Access database(. The Pandas modules uses objects to allow for data analysis at a fairly high performance rate in comparison to typical Python procedures. They are extracted from open source Python projects. Filed Under: Pandas DataFrame, Python, Python Tips, read_csv in Pandas Tagged With: load a big file in chunks, pandas chunksize, Pandas Dataframe, Python Tips Subscribe to Blog via Email Enter your email address to subscribe to this blog and receive notifications of new posts by email. This header is sometimes referred to as the front matter. Pandas can, of course, also be used to load a SPSS file into a dataframe. For example the pandas. read_excel function doesn't have a cursor like pd. Note, however, we need to install the Pyreadstat package as, at least right now, Pandas depends on this for reading. You can vote up the examples you like or vote down the ones you don't like. The function below named plot_chunk_inputs() takes the data in chunk format and a list of chunk ids to plot. Package 'chunked' July 2, 2017 Type Package Title Chunkwise Text-File Processing for 'dplyr' Version 0. These may help you too. If the separator between each field of your data is not a comma, use the sep argument. In order to do this with the subprocess library, one would execute following shell command:. Parameter Description; path_or_buf: string or file handle, default None File path or object, if None is provided the result is returned as a string. In the specific case:. However, with about 3 billion rows per file, that puts it at ~8 hours for one 120gb file. In practice, it's often easiest simply to use chunks() all the time. We will split this string into chunks of length 3 using for loop. Python in R Markdown — A new Python language engine for R Markdown that supports bi-directional communication between R and Python (R chunks can access Python objects and vice-versa). Dask: It is a framework built on top of Pandas and built with multi-processing and distributed processing in mind. n = 3 # chunk length chunks = [str[i:i+n] for i in range(0, len(str), n)] Example to Split String to specific length Chunks. To ensure no mixed types either set False, or specify the type with the dtype parameter. So the question is: How to reduce memory usage of data using Pandas? The following explanation will be based my experience on an anonymous large data set (40-50 GB) which required me to reduce the memory usage to fit into local memory for analysis (even before reading the data set to a dataframe). If it's a csv file and you do not need to access all of the data at once when training your algorithm, you can read it in chunks. The file is being read in chunks because it is too large to fit into memory in its entirety. get_schema taken from open source projects. Now, lets take a look at modin. First off, there is a low_memory parameter in the read_csv function that is set to True by default. Dataset¶ Coerce all arrays in this dataset into dask arrays with the given chunks. Let's break down code chunks in. The data in a csv file can be easily load in Python as a data frame with the function pd. In our main task, we set chunksize as 200,000, and it used 211. Jun 14, 2017. To Split String in Python using delimiter, you can use String. Introduces Python, pandas, Anaconda, Jupyter Notebook, and the course prerequisites; Explores sample Jupyter Notebooks to showcase the power of pandas for data analysis; The pandas. By voting up you can indicate which examples are most useful and appropriate. Read & merge multiple CSV files (with the same structure) into one DF; Read a specific sheet; Read in chunks; Read Nginx access log (multiple quotechars) Reading csv file into DataFrame; Reading cvs file into a pandas data frame when there is no header row; Save to CSV file; Spreadsheet to dict of DataFrames; Testing read_csv; Using HDFStore. If you are not interested in the contents of the chunk, this method should be called so that the file points to the start of the next chunk. Read Excel column names We import the pandas module, including ExcelFile. They are extracted from open source Python projects. For more, view this R Markdown documentation. In the specific case:. Code Chunks. chunks(chunk_size=None)¶ A generator returning chunks of the file. I have a large fixed width file being read into pandas in chunks of 10000 lines. Simple, expressive and arguably one of the most important libraries in Python, not only does it make real-world Data Analysis significantly easier but provides an optimized feature of being significantly fast. The file is around 7 GB in size and i need to extract and filter the data from the file and save it to the MySQL database. YAML Header (front matter) An R Markdown file always starts with a header written using YAML syntax. ExcelFile(). A bit more info -- the problem does not manifest if I add engine='python' to the call to read_csv. Python in R Markdown — A new Python language engine for R Markdown that supports bi-directional communication between R and Python (R chunks can access Python objects and vice-versa). For example, we want to change these pipe separated values to a dataframe using pandas read_csv separator. Pandas is an open-source library for python. If the separator between each field of your data is not a comma, use the sep argument. Chunk options like echo, include, etc. In the example above, the for loop retrieves the whole csv file in four chunks. However, in case of BIG DATA CSV files, it provides functions that accept chunk size to read big data in smaller chunks. read_csv method allows you to read a file in chunks like this: import pandas as pd for chunk in pd. All objects created within Python chunks are available to R using the py object exported by the reticulate package. Other data structures, like DataFrame and Panel, follow the dict-like convention of iterating over the keys of the objects. Here is an example of Import a file in chunks: When working with large files, it can be easier to load and process the data in pieces. Python data scientists often use Pandas for working with tables. pandas documentation: Save pandas dataframe to a csv file. It may be spread across a number of text files, spreadsheets, or databases. Iterate through each chunk and write the chunks in the file until the chunks finished. 5 GB table looks like this: columns=[ka,kb_1,kb_2,timeofEvent,timeInterval]. We have set to 1024 bytes. Examples to split string using delimiter, split to specific number of chunks, spaces as delimiter, etc. Loading A CSV Into pandas. The following are code examples for showing how to use pandas. Pandas groupby function enables us to do "Split-Apply-Combine" data analysis paradigm easily. Python Pandas - GroupBy - Any groupby operation involves one of the following operations on the original object. The C engine is "filling in the blanks" thanks to the names parameter that you passed in, so while I'm still wary of the jagged CSV format, pandas is a little more generous than I recalled. If you didn't, access it as the "f" attribute on the TextFileReader object and call "seek(0)" on that instead. It is not a good idea to load it at once if you don't have enough memory. Data will be read and written in blocks with shape (100,100); for example, the data in dset[0:100,0:100] will be stored together in the file, as will the data points in range dset[400:500, 100:200]. For example the pandas. ExcelFile(). The library is highly optimized for dealing with large tabular datasets through its DataFrame structure. 4 Description Text data can be processed chunkwise using 'dplyr' commands. In this section, we are going to load the same. 7 support as well. Here is an example of Import a file in chunks: When working with large files, it can be easier to load and process the data in pieces. TextFileReader. Pandas can, of course, also be used to load a SPSS file into a dataframe. By file-like object, we refer to objects with a read() method, such as a file handler (e. In this article we will read excel files using Pandas. In the specific case:. verify_integrity. In order to do this you call pandas. Here is an example of Import a file in chunks: When working with large files, it can be easier to load and process the data in pieces. ) This reinforces that the behavior above is a bug. We can repeat this for a few chunks to get an idea how the temporal structure may differ across chunks. Two good examples are Hadoop with the Mahout machine learning library and Spark wit the MLLib library. Chunking in Python---How to set the "chunk size" of read lines from file read with Python open()? I have a fairly large text file which I would like to run in chunks. So the question is: How to reduce memory usage of data using Pandas? The following explanation will be based my experience on an anonymous large data set (40-50 GB) which required me to reduce the memory usage to fit into local memory for analysis (even before reading the data set to a dataframe). It's pandas, but designed for when the data is too large. If you are on windows open the resource monitor (hit windows +r then type "resmon"). read_sql_query(). sheet_names. We have set to 1024 bytes. via builtin open function) or StringIO. You now have a good sense of how useful pandas is when importing the CSV file, and conveniently, pandas offers other similar and equally handy functions to import Excel, SAS, and Stata files to name a few. The web service then breaks the chunk of data up into several smaller pieces, each will the header (first line of the chunk). Is there a more elegant way of doing it? Assume that the file chunks are too large to be held in memory. If you saved a reference to the file object, just call "seek(0)" on that object. split() function. Writing an iterator to load data in chunks (2) In the previous exercise, you used read_csv() to read in DataFrame chunks from a large dataset. dztotalfilesize - The entire file's size. Loading a massive file in chunks in pandas Loading CSV file. So far, we have learned many supervised and unsupervised machine learning algorithm and now this is the time to see their practical implementation. Creating a Spark DataFrame converted from a Pandas DataFrame (the opposite direction of toPandas()) actually goes through even more conversion and bottlenecks if you can believe it. Create dataframe (that we will be importing) df. However, in case of BIG DATA CSV files, it provides functions that accept chunk size to read big data in smaller chunks. An example might be a. If you read any tutorial about reading CSV file using pandas, they might use from_csv function. Problem description: I use python pandas to read a few large CSV file and store it in HDF5 file, the resulting HDF5 file is about 10GB. A bit more info -- the problem does not manifest if I add engine='python' to the call to read_csv. Writing an iterator to load data in chunks (2) In the previous exercise, you used read_csv() to read in DataFrame chunks from a large dataset. Evicting pandas data from RAM that is no longer needed; Dask makes it easy to read a directory of CSV files by running pandas. csv' in Bringing it all together exercises of the prequel course, but this time, working on it in chunks. Read STATA. If you didn't, access it as the "f" attribute on the TextFileReader object and call "seek(0)" on that instead. I have a large fixed width file being read into pandas in chunks of 10000 lines. If a file object is passed it should be opened with newline='' , disabling universal newlines. YAML Header (front matter) An R Markdown file always starts with a header written using YAML syntax. Pandas groupby function enables us to do "Split-Apply-Combine" data analysis paradigm easily. Is there a more elegant way of doing it? Assume that the file chunks are too large to be held in memory. You can vote up the examples you like or vote down the ones you don't like. This tutorial video covers how to open big data files in Python using buffering. This package is fully compatible with Python >=3. You don't want to read the huge file into memory and then process it. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. In this article we will read excel files using Pandas. Each timing uses a different set of chunks, so we are not exploiting chunk caching. Truly, what Matt Rocklin and team have built is an excellent piece of kit. This line of code pd. License GPL-2 LazyData TRUE. html file, and place them appropriately in the editor. The Python Pandas read_csv function is used to read or load data from CSV files. read_csv in pandas. Chunking in Python---How to set the "chunk size" of read lines from file read with Python open()? I have a fairly large text file which I would like to run in chunks. It doesn't do quite everything pandas does, but it's got most of the commonly used functions. And indexes are immutable, so each time you append pandas has to create an entirely new one. In this exercise, you will read in a file using a bigger DataFrame chunk size and then process the data from the first chunk. If you want to pass in a path object, pandas accepts any os. Quick HDF5 with Pandas HDF5 is a format designed to store large numerical arrays of homogenous type. They are extracted from open source Python projects. Since only one chunk is loaded at a time, the peak memory usage has come down to 7K, compared 28K when we load the full csv. If you didn't, access it as the "f" attribute on the TextFileReader object and call "seek(0)" on that instead. Using a TextParser, you can read and process the data line by line in a for loop. Break a list into n-sized chunks. By voting up you can indicate which examples are most useful and appropriate. Iterate over the file in csv_file file by using a for loop. import pandas as pd import numpy as np # setting the number of rows for the CSV file N = 1000000 # creating a pandas where it is also explained how separating the input file in chunks can help. Re-index a dataframe to interpolate missing…. chunks(chunk_size=None)¶ A generator returning chunks of the file. zip file in the directory of your choice. You now have a good sense of how useful pandas is when importing the CSV file, and conveniently, pandas offers other similar and equally handy functions to import Excel, SAS, and Stata files to name a few. the dot product of an array a with iteself transposed. For the second chunk onwards, the chunk dataframe index starts with chunk index(i. We'll store the results from the groupby in a list of pandas. concat(chunk for chunk in csv_chunks) File "C:Program FilesPythonAnacondalibsite-packagespandastoolsmerge. So far, we have learned many supervised and unsupervised machine learning algorithm and now this is the time to see their practical implementation. A solution to this is to process an entire data source chunk by chunk, instead of a single go all at once. Read CSV file data in chunk size. For example, we want to change these pipe separated values to a dataframe using pandas read_csv separator. You can vote up the examples you like or vote down the ones you don't like. chunk¶ Dataset. They are extracted from open source Python projects. The file is around 7 GB in size and i need to extract and filter the data from the file and save it to the MySQL database. In the last section we downloaded a bunch of weather files, one per state, writing each to a separate CSV. Hence, it is recommended to use read_csv instead. All further calls to read() for the chunk will return b''. DataSet2) in chunks to the existing DF to be quite feasible. 補足 pandas の read_csv はかなり速いので、パフォーマンス面でも pandas を使うのはよいと思う。 A new high performance, memory-efficient file parser engine for pandas | Wes McKinney. How to Read a SPSS file in Python Using Pandas. Here is an example of Import a file in chunks: When working with large files, it can be easier to load and process the data in pieces. Two good examples are Hadoop with the Mahout machine learning library and Spark wit the MLLib library. Splitting is a process that keeps chunks from growing too large. Repacking a file to remove compression or change chunk dimensions. Read CSV file data in chunk size. File split made easy for python programmers! A python module that can split files of any size into multiple chunks, with optimum use of memory and without compromising on performance. How to Read a SPSS file in Python Using Pandas. With pandas, it seems to take - on average - 1. These are recorded and executed per data chunk, so large files can be processed with limited memory using the 'LaF' package. If pandas were to read the above csv file without any dtype option, the age would be stored as strings in memory until pandas has read enough lines of the csv file to make a qualified guess. Use the py object to access objects created in Python chunks from R chunks. In this tutorial you're going to learn how to work with large Excel files in Pandas, focusing on reading and analyzing an xls file and then working with a subset of the original data. Each timing uses a different set of chunks, so we are not exploiting chunk caching. As always, we need to import Pandas as pd: import pandas as pd. A bit more info -- the problem does not manifest if I add engine='python' to the call to read_csv. Other data structures, like DataFrame and Panel, follow the dict-like convention of iterating over the keys of the objects. read_excel(). The next slowest database (SQLite) is still 11x faster than reading your CSV file into pandas and then sending that DataFrame to PostgreSQL with the to_pandas method.