This tutorial explains various methods to read data in Python. Data can be in any of the popular formats - CSV, TXT, XLS/XLSX (Excel), sas7bdat (SAS), Stata, Rdata (R) etc. Loading data in python environment is the most initial step of analyzing data.
While importing external files, we need to check the following points -
If you are using Anaconda, pandas must be already installed. You need to load the package by using the following command -
We can include column names by using names= option.
The read_excel() function can be used to import excel data into Python.
If you have a large SAS File, you can try package named
![]() |
Import Data into Python |
- Check whether header row exists or not
- Treatment of special values as missing values
- Consistent data type in a variable (column)
- Date Type variable in consistent date format.
- No truncation of rows while reading external data
Table of Contents
Install and Load pandas Package
pandas is a powerful data analysis package. It makes data exploration and manipulation easy. It has several functions to read data from various sources.If you are using Anaconda, pandas must be already installed. You need to load the package by using the following command -
import pandas as pdIf pandas package is not installed, you can install it by running the following code in Ipython Console. If you are using Spyder, you can submit the following code in Ipython console within Spyder.
!pip install pandasIf you are using Anaconda, you can try the following line of code to install pandas -
!conda install pandas
1. Import CSV files
It is important to note that a singlebackslash does not work when specifying the file path. You need to either change it to forward slash or add one more backslash like belowimport pandas as pd
mydata= pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv")
If no header (title) in raw data file
mydata1 = pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv", header = None)You need to include header = None option to tell Python there is no column name (header) in data.
Add Column Names
We can include column names by using names= option.
mydata2 = pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv", header = None, names = ['ID', 'first_name', 'salary'])The variable names can also be added separately by using the following command.
mydata1.columns = ['ID', 'first_name', 'salary']
2. Import File from URL
You don't need to perform additional steps to fetch data from URL. Simply put URL in read_csv() function (applicable only for CSV files stored in URL).mydata = pd.read_csv("http://winterolympicsmedals.com/medals.csv")
3. Read Text File
We can use read_table() function to pull data from text file. We can also use read_csv() with sep= "\t" to read data from tab-separated file.mydata = pd.read_table("C:\\Users\\Deepanshu\\Desktop\\example2.txt")
mydata = pd.read_csv("C:\\Users\\Deepanshu\\Desktop\\example2.txt", sep ="\t")
4. Read Excel File
The read_excel() function can be used to import excel data into Python.
mydata = pd.read_excel("https://www.eia.gov/dnav/pet/hist_xls/RBRTEd.xls",sheetname="Data 1", skiprows=2)If you do not specify name of sheet in sheetname= option, it would take by default first sheet.
5. Read delimited file
Suppose you need to import a file that is separated with white spaces.mydata2 = pd.read_table("http://www.ssc.wisc.edu/~bhansen/econometrics/invest.dat", sep="\s+", header = None)To include variable names, use the names= option like below -
mydata3 = pd.read_table("http://www.ssc.wisc.edu/~bhansen/econometrics/invest.dat", sep="\s+", names=['a', 'b', 'c', 'd'])
6. Read SAS File
We can import SAS data file by using read_sas() function.mydata4 = pd.read_sas('cars.sas7bdat')
If you have a large SAS File, you can try package named
pyreadstat
which is faster than pandas. It is equivalent to haven
package in R which provides easy and fast way to read data from SAS, SPSS and Stata. To install this package, you can use the command pip install pyreadstat
import pyreadstat df, meta = pyreadstat.read_sas7bdat('cars.sas7bdat') # done! let's see what we got print(df.head()) print(meta.column_names) print(meta.column_labels) print(meta.number_rows) print(meta.number_columns)
7. Read Stata File
We can load Stata data file via read_stata() function.mydata41 = pd.read_stata('cars.dta')
pyreadstat
package lets you to pull value labels from stata files.
import pyreadstat df, meta = pyreadstat.read_dta("cars.dta")To get labels, set
apply_value_formats
as TRUE
df, meta = pyreadstat.read_dta("cars.dta", apply_value_formats=True)
8. Import R Data File
Using pyreadr package, you can load .RData and .Rds format files which in general contains R data frame. You can install this package using the command below -pip install pyreadrWith the use of read_r( ) function, we can import R data format files.
import pyreadrSimilarly, you can read .Rds formatted file.
result = pyreadr.read_r('C:/Users/sampledata.RData')
print(result.keys()) # let's check what objects we got
df1 = result["df1"] # extract the pandas data frame for object df1
9. Read SQL Table
We can extract table from SQL database (SQL Server / Teradata). See the program below -SQL Server
You can read data from tables stored in SQL Server by building a connection. You need to have server, User ID (UID), database details to establish connection.
import pandas as pd import pyodbc conn = pyodbc.connect("Driver={SQL Server};Server=serverName;UID=UserName;PWD=Password;Database=RCO_DW;") df = pd.read_sql_query('select * from dbo.Table WHERE ID > 10', conn) df.head()
Teradata
You need to import Teradata module which makes python easily integrated with Teradata Database.
import pandas as pd import teradata udaExec = teradata.UdaExec(appName="HelloWorld", version="1.0", logConsole=False) session = udaExec.connect(method="odbc", USEREGIONALSETTINGS="N", system="tdprod", username="xxx", password="xxx"); query = "SELECT * FROM flight" df = pd.read_sql(query , session)
Explanation
UdaExec
provides DevOps support features such as configuration and logging.- You can assign any name and version in
appName
andversion
logConsole=False
tells Python not to log to the console.system="tdprod"
refers to name of the system we are connecting using ODBC as the connection methodUSEREGIONALSETTINGS="N"
is used to ensure that float values can be loaded and make decimal separator be ‘.’
.db
extension file which is a database file and you want to extract data from it.
import sqlite3
from pandas.io import sql
conn = sqlite3.connect('C:/Users/Deepanshu/Downloads/flight.db')
query = "SELECT * FROM flight"
results = pd.read_sql(query, con=conn)
print results.head()
10. Import Data from SPSS File
import pyreadstat df, meta = pyreadstat.read_sav("file.sav", apply_value_formats=True)If you don't want value labels, make apply_value_formats as False.
11. Read sample of rows and columns
By specifying nrows= and usecols=, you can fetch specified number of rows and columns.mydata7 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", nrows=5, usecols=(1,5,7))nrows = 5 implies you want to import only first 5 rows and usecols= refers to specified columns you want to import.
12. Skip rows while importing
Suppose you want to skip first 5 rows and wants to read data from 6th row (6th row would be a header row)mydata8 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", skiprows=5)
13. Specify values as missing values
By including na_values= option, you can specify values as missing values. In this case, we are telling python to consider dot (.) as missing cases.mydata9 = pd.read_csv("workingfile.csv", na_values=['.'])
Thanks a ton...this is such a concise and simple list for reference. Have it bookmarked.
ReplyDeleteyou can try pyreadstat for reading SAS, STATA and SPSS files, it's faster than other packages!
ReplyDeletehttps://github.com/Roche/pyreadstat
Thanks for sharing. I have added it in the above article.
Deleteawesome!
DeleteTo export from local to sas server
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeletegood work..
ReplyDeletereally helpful
ReplyDelete