How to Import Data into Python

This tutorial explains the various methods to read data in Python including popular formats such as CSV, Text, Excel, SQL, SAS, Stata, and R Data. Loading data into the Python environment is the first step in any data analysis project.

We will use the pandas package to import data into Python. If it's not installed, run
pip install pandas in the command prompt. To load it, use import pandas as pd in your code.

Import Data into Python

Table of Contents

1. Import CSV files

To import a CSV file into Python, we can use the read_csv( ) function from the pandas package. It is important to note that a singlebackslash does not work when specifying the file path. You need to either change it to forward slash or add one more backslash like below.


import pandas as pd
mydata= pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv")

If no header (title) in raw data file


mydata1 = pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv", header = None)

You need to include header = None option to tell Python there is no column name (header) in data.

Add Column Names

We can include column names by using names= option.


mydata2 = pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv", header = None, names = ['ID', 'first_name', 'salary'])

The variable names can also be added separately by using the following command.

mydata1.columns = ['ID', 'first_name', 'salary']

Import File from URL

You don't need to perform additional steps to fetch data from URL. Simply put URL in the read_csv() function (applicable only for CSV files stored in URL).

mydata = pd.read_csv("http://winterolympicsmedals.com/medals.csv")

Read sample of rows and columns

By specifying nrows= and usecols=, you can fetch specified number of rows and columns.


mydata7 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", nrows=5, usecols=(1,5,7))

nrows = 5 implies you want to import only first 5 rows and usecols= refers to specified columns you want to import.

Skip rows while importing

Suppose you want to skip first 5 rows and wants to read data from 6th row (6th row would be a header row)


mydata8 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", skiprows=5)

Specify values as missing values

By including na_values= option, you can specify values as missing values. In this case, we are telling python to consider dot (.) as missing cases.

mydata9 = pd.read_csv("workingfile.csv", na_values=['.'])

Detailed Explanation : Import CSV File in Python

2. Read Text File

We can use read_table() function to pull data from text file. We can also use read_csv() with sep= "\t" to read data from tab-separated file.


import pandas as pd  
mydata  = pd.read_table("C:\\Users\\Deepanshu\\Desktop\\example2.txt")
mydata = pd.read_csv("C:\\Users\\Deepanshu\\Desktop\\example2.txt", sep ="\t")

3. Read Excel File

The read_excel() function can be used to import excel data into Python.


import pandas as pd  
mydata = pd.read_excel("https://www.eia.gov/dnav/pet/hist_xls/RBRTEd.xls",sheetname="Data 1", skiprows=2)

If you do not specify name of sheet in sheetname= option, it would take by default first sheet. The skiprows= option tells python to skip first N number of rows while importing data.

4. Read Delimited File

Suppose you need to import a file that is separated with white spaces.


import pandas as pd  
mydata2 = pd.read_table("http://www.ssc.wisc.edu/~bhansen/econometrics/invest.dat", sep="\s+", header = None)

To include variable names, use the names= option like below -


mydata3 = pd.read_table("http://www.ssc.wisc.edu/~bhansen/econometrics/invest.dat", sep="\s+", names=['a', 'b', 'c', 'd'])

5. Read SAS File

We can import SAS data file by using the read_sas() function from the pandas package.


import pandas as pd  
mydata4 = pd.read_sas('cars.sas7bdat')

If you have a large SAS File, you can try package named pyreadstat which is faster than pandas. It is equivalent to haven package in R which provides easy and fast way to read data from SAS, SPSS and Stata. To install this package, you can use the command pip install pyreadstat


import pyreadstat
df, meta = pyreadstat.read_sas7bdat('cars.sas7bdat')

# let's see what we got
print(df.head())
print(meta.column_names)
print(meta.column_labels)
print(meta.number_rows)
print(meta.number_columns)

6. Read Stata File

We can load Stata data file via read_stata() function.


import pandas as pd  
mydata41 = pd.read_stata('cars.dta')

pyreadstat package lets you to pull value labels from stata files.


import pyreadstat
df, meta = pyreadstat.read_dta("cars.dta")

To get labels, set apply_value_formats as TRUE


df, meta = pyreadstat.read_dta("cars.dta", apply_value_formats=True)

7. Import R Datafile

Using pyreadr package, you can load .RData and .Rds format files which in general contains R data frame. You can install this package using the command below -

pip install pyreadr

With the use of read_r( ) function, we can import R data format files.


import pyreadr
result = pyreadr.read_r('C:/Users/sampledata.RData')
print(result.keys()) # let's check what objects we got
df1 = result["df1"] # extract the pandas data frame for object df1

Similarly, you can read .Rds formatted file.

8. Import SQL Table

We can extract table from SQL database (SQL Server / Teradata). See the program below -

SQL Server

You can read data from tables stored in SQL Server by building a connection. You need to have server, User ID (UID), database details to establish connection.


import pandas as pd
import pyodbc 
conn = pyodbc.connect("Driver={SQL Server};Server=serverName;UID=UserName;PWD=Password;Database=RCO_DW;")
df = pd.read_sql_query('select * from dbo.Table WHERE ID > 10', conn)
df.head()

Teradata

You need to import Teradata module which makes python easily integrated with Teradata Database.


import pandas as pd
import teradata
udaExec = teradata.UdaExec(appName="HelloWorld", version="1.0",
                           logConsole=False)
session = udaExec.connect(method="odbc",
                          USEREGIONALSETTINGS="N",
                          system="tdprod",
                          username="xxx",
                          password="xxx");

query = "SELECT * FROM flight"
df = pd.read_sql(query , session)

Explanation

UdaExec provides DevOps support features such as configuration and logging. You can assign any name and version in appName and version
logConsole=False tells Python not to log to the console.
system="tdprod" refers to name of the system we are connecting using ODBC as the connection method
USEREGIONALSETTINGS="N" is used to ensure that float values can be loaded and make decimal separator be ‘.’

9. Import SPSS File


import pyreadstat
df, meta = pyreadstat.read_sav("file.sav", apply_value_formats=True)

If you don't want value labels, make apply_value_formats as False.

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn