This article explores efficient ways to export large datasets to CSV files using PostgreSQL's COPY
function. The traditional way with Ruby CSV library is compared to direct PostgreSQL approach.
CSV is a simple and widely used file format for storing tabular data, such as spreadsheets or databases. In a CSV file, each line represents a row of data, and the values within a row are separated by commas (or other specified delimiters, such as semicolons or tabs). So it is pretty popular to import or export data from your system using CSV format.
Regular Ruby/Rails exports
In Ruby, you can use the built-in CSV library to export data to a CSV file. Here's a simple example:
require 'csv'
headers = ['Name', 'Age', 'Location']
data = [
['John Doe', 25, 'New York'],
['Jane Smith', 30, 'San Francisco'],
['Bob Johnson', 22, 'Chicago']
]
# or for ex. data = Person.all.pluck(:name, :age, :location)
csv_file_path = 'example.csv'
CSV.open(csv_file_path, 'w', write_headers: true, headers: headers) do |csv|
data.each do |row|
csv << row
end
end
puts "CSV file has been created at #{csv_file_path}"
It looks pretty nice and simple - we can replace easily our data with SQL’s selects, map them into separated rows and voilà - our export is ready. This is a popular, safe approach used widely in Ruby applications.
Clearly, for large datasets, traditional methods don't work well. A more memory-efficient approach, such as using the CSV.foreach
method, can be considered. However, for thousands of records or more, this can still be time-consuming. A logical step would be to use a background worker like Sidekiq to address performance issues and reduce headaches. Additionally, gems like smarter_csv, which supports parallel processing, lazy loading, and in-house Sidekiq jobs, could be considered. However, these solutions require more and more resources to operate efficiently.
What is COPY function?
I will explore a slightly different approach offered by PostgreSQL. Moving some logic into the database layer could significantly improve performance. Let's examine this closely.
According to PostgreSQL definition:
COPY
moves data between PostgreSQL tables and standard file-system files.COPY TO
copies the contents of a table to a file, whileCOPY FROM
copies data from a file to a table (appending the data to whatever is in the table already).COPY TO
can also copy the results of aSELECT
query.
Here we want to make a use of COPY TO. An example of this is:
COPY (table/SELECT query) TO '/example.csv';
There is a significant issue - the file data is saved on the database server. This can be problematic unless you have a dedicated server that houses both the database and file storage, which is not very common today. Alternatively, if you wish to save it client-side, you can use the psql \copy
command. This command accesses data from Postgres STDOUT on the remote server and stores it in a file on the client side. However, it's not very convenient to call the psql client if we want to use it from an application perspective. Let's examine what the \copy
command actually does:
\copy
invokesCOPY FROM STDIN
orCOPY TO STDOUT
, and then fetches/stores the data in a file accessible to the psql client
So maybe we can make use of it and manage COPY TO STDOUT
with some help of ActiveRecord and pg gem?
COPY vs CSV usage
Imagine we have a large set of users ( structure of table below) that needs to be exported. To clarify - I am using table with around 1,500,000 records:
create_table :users do |t|
t.string :email, null: false, index: { unique: true }
t.string :first_name
t.string :last_name
t.string :username, null: false, index: { unique: true }
t.integer :status, default: 0
t.jsonb :settings, default: {}
t.timestamps
end
Let’s consider such piece of code with 2 methods - one is PostgreSQL COPY usage, second one is plain CSV Ruby library:
class User < ApplicationRecord
enum status: { active: 0, archived: 1, blocked: 2, inactive: 3, part_active: 4, disabled: 5 }
def self.copy_export
output_file_path = 'output.csv'
begin
output_file = File.open(output_file_path,'w')
query = <<-SQL
COPY (SELECT * FROM users ORDER BY id LIMIT 300000)
TO STDOUT WITH CSV HEADER
SQL
ActiveRecord::Base.connection.raw_connection.copy_data(query) do
while (row = ActiveRecord::Base.connection.raw_connection.get_copy_data)
output_file.write(row)
end
end
rescue StandardError => e
puts "Error exporting data: #{e.message}"
ensure
output_file&.close
end
end
def self.csv_export
output_file_path = 'outfile.csv'
result = User.all.order(:id).limit(300000);
headers = User.column_names
CSV.open(output_file_path, 'w', write_headers: true, headers: headers) do |csv|
result.each do |row|
csv << row.attributes.values
end
end
end
end
Let's analyze the copy_export
method. First, we establish a file path in the begin
block. We then use ActiveRecord::Base.connection.raw_connection
to connect directly to the pg
gem layer. This allows us to call the [copy_data](https://www.rubydoc.info/gems/pg/PG%2FConnection:copy_data)
method, which in turn calls the COPY
function with its options - a SELECT query, CSV format, and HEADERS (the column names as CSV file headers). We then iterate over the rows returned to STDOUT with get_copy_data
and write them to our file. The remaining code is a general rescue from errors and ensures that the file will be closed.
Next, I'll check the results and use the simple Ruby Benchmark library to measure the speed of the copy_export
method. I'll use the query SELECT * FROM users ORDER BY id LIMIT 300000
, which retrieves the first 300,000 rows from the users
table.
3.1.2 :052 > Benchmark.realtime { User.copy_export }
=> 1.4312629997730255
id,email,first_name,last_name,username,status,settings,created_at,updated_at
1,mindy.lockman@fritsch.test,Susannah,Moore,sade,5,{},2023-11-12 21:01:05.604871,2023-11-12 21:01:05.604871
2,elmer@willms-stokes.example,Ammie,Bradtke,alison.schinner,4,{},2023-11-12 21:01:05.620698,2023-11-12 21:01:05.620698
3,susie@witting.test,Corinna,Kautzer,laree.walsh,0,{},2023-11-12 21:01:05.656736,2023-11-12 21:01:05.656736
4,quinton@becker.example,Titus,Crist,courtney,2,{},2023-11-12 21:01:05.663168,2023-11-12 21:01:05.663168
5,mariette_hauck@okon-howe.example,Perla,Ward,lindsay,2,{},2023-11-12 21:01:05.6688,2023-11-12 21:01:05.6688
6,dinah.collier@mertz-fritsch.test,Elvira,Hahn,cole_marvin,4,{},2023-11-12 21:01:05.673525,2023-11-12 21:01:05.673525
...
299997,lakita.cole@christiansen.example40420qa,Milford40420ad,Feil40420xa,calvin40420ad,0,{},2023-11-12 21:41:07.756856,2023-11-12 21:41:07.756856
299998,korey_schroeder@treutel-witting.example40421qa,Ezequiel40421ad,Rowe40421xa,ramiro40421ad,3,{},2023-11-12 21:41:07.760369,2023-11-12 21:41:07.760369
299999,roscoe@effertz-breitenberg.test40422qa,Piedad40422ad,Wilkinson40422xa,joya40422ad,1,{},2023-11-12 21:41:07.764228,2023-11-12 21:41:07.764228
300000,giuseppe.dickens@rutherford-green.example40423qa,Jared40423ad,Mohr40423xa,lilliana.daniel40423ad,5,{},2023-11-12 21:41:07.768311,2023-11-12 21:41:07.768311
The second method, csv_export
, is straightforward. It defines the file path, retrieves the first 300,000 rows from the users
table, obtains the headers' names and, row by row, populates the CSV file. Let's review the results and benchmark for this method.
3.1.2 :059 > Benchmark.realtime { User.csv_export }
=> 22.258816000074148
id,email,first_name,last_name,username,status,settings,created_at,updated_at
1,mindy.lockman@fritsch.test,Susannah,Moore,sade,disabled,{},2023-11-12 21:01:05 UTC,2023-11-12 21:01:05 UTC
2,elmer@willms-stokes.example,Ammie,Bradtke,alison.schinner,part_active,{},2023-11-12 21:01:05 UTC,2023-11-12 21:01:05 UTC
3,susie@witting.test,Corinna,Kautzer,laree.walsh,active,{},2023-11-12 21:01:05 UTC,2023-11-12 21:01:05 UTC
4,quinton@becker.example,Titus,Crist,courtney,blocked,{},2023-11-12 21:01:05 UTC,2023-11-12 21:01:05 UTC
5,mariette_hauck@okon-howe.example,Perla,Ward,lindsay,blocked,{},2023-11-12 21:01:05 UTC,2023-11-12 21:01:05 UTC
6,dinah.collier@mertz-fritsch.test,Elvira,Hahn,cole_marvin,part_active,{},2023-11-12 21:01:05 UTC,2023-11-12 21:01:05 UTC
...
299997,lakita.cole@christiansen.example40420qa,Milford40420ad,Feil40420xa,calvin40420ad,active,{},2023-11-12 21:41:07 UTC,2023-11-12 21:41:07 UTC
299998,korey_schroeder@treutel-witting.example40421qa,Ezequiel40421ad,Rowe40421xa,ramiro40421ad,inactive,{},2023-11-12 21:41:07 UTC,2023-11-12 21:41:07 UTC
299999,roscoe@effertz-breitenberg.test40422qa,Piedad40422ad,Wilkinson40422xa,joya40422ad,archived,{},2023-11-12 21:41:07 UTC,2023-11-12 21:41:07 UTC
300000,giuseppe.dickens@rutherford-green.example40423qa,Jared40423ad,Mohr40423xa,lilliana.daniel40423ad,disabled,{},2023-11-12 21:41:07 UTC,2023-11-12 21:41:07 UTC
Definitely it is noticeable that using Postgres COPY
is much quicker - in this example it is around 15 times faster then Ruby library, so it is worth to consider, if we need to speed up generation of CSV files significantly.
Solving some problems
Things become complex when our application logic modifies data, and the data retrieved directly from the database differs from that in the application layer. In this example there are three fields with noticeable differences - status
, created_at
and updated_at
. With status
we get direct integer value of enum instead of string value, with created_at
and updated_at
format is changed.
There are two ways to resolve these issues. The first method involves manipulating data at the database level. The second method involves modifying each row at the get_copy_data
layer before saving it.
Database layer
At the database layer it would look like this - query needs to return modified data:
def self.copy_export
# the rest stays the same
query = <<-SQL
COPY (
SELECT id, email, first_name, last_name, username,
CASE status
WHEN #{statuses[:active]} THEN 'active'
WHEN #{statuses[:archived]} THEN 'archived'
WHEN #{statuses[:blocked]} THEN 'blocked'
WHEN #{statuses[:inactive]} THEN 'inactive'
WHEN #{statuses[:part_active]} THEN 'part_active'
WHEN #{statuses[:disabled]} THEN 'disabled'
ELSE NULL
END AS status, settings, TO_CHAR(created_at::timestamp AT TIME ZONE 'UTC', 'YYYY-MM-DD HH24:MI:SS UTC') AS created_at,
TO_CHAR(updated_at::timestamp AT TIME ZONE 'UTC', 'YYYY-MM-DD HH24:MI:SS UTC') AS updated_at
FROM users
ORDER BY id
LIMIT 300000
) TO STDOUT WITH CSV HEADER
SQL
# the rest stays the same
end
After that, I can run copy_export
again:
3.1.2 :084 > Benchmark.realtime { User.copy_export }
=> 1.49278200045228
And here is our CSV file:
id,email,first_name,last_name,username,status,settings,created_at,updated_at
1,mindy.lockman@fritsch.test,Susannah,Moore,sade,disabled,{},2023-11-12 21:01:05 UTC,2023-11-12 21:01:05 UTC
...
300000,giuseppe.dickens@rutherford-green.example40423qa,Jared40423ad,Mohr40423xa,lilliana.daniel40423ad,disabled,{},2023-11-12 21:41:07 UTC,2023-11-12 21:41:07 UTC
Now, our CSV generated via COPY method looks the same as generated from Ruby library.
Rubyish solution
Second approach is to manipulate on already generated rows - it is possible to do that, because get_copy_data
returns each row as Ruby string ( it may be also returned as array - check CopyRow decoder). Here I present some rough implementation of operating on Ruby data:
def self.copy_export
output_file_path = 'output.csv'
begin
output_file = File.open(output_file_path, 'w')
i = 0
ActiveRecord::Base.connection.raw_connection.copy_data("COPY (SELECT * FROM users ORDER BY id LIMIT 300000) TO STDOUT WITH CSV HEADER") do
while (row = ActiveRecord::Base.connection.raw_connection.get_copy_data)
if i > 0
values = row.split(",")
values[5] = User.statuses.key(values[5].to_i)
values[7] = DateTime.parse(values[7]).strftime("%Y-%m-%d %H:%M:%S UTC")
values[8] = DateTime.parse(values[8]).strftime("%Y-%m-%d %H:%M:%S UTC")
output_file.puts(values.join(','))
else
output_file.write(row)
end
i += 1
end
end
rescue StandardError => e
puts "Error exporting data: #{e.message}"
ensure
output_file&.close
end
end
In short, I skip the row with headers and work with strings mapped to an array so that I can manipulate the values I want to change.
Again, I run copy_export
once more time:
3.1.2 :232 > Benchmark.realtime { User.copy_export }
=> 5.370280999690294
Noticeably, speed of generation my CSV file went down because of direct Ruby manipulation on its data - but still, it is much better then using Ruby library itself.
id,email,first_name,last_name,username,status,settings,created_at,updated_at
1,mindy.lockman@fritsch.test,Susannah,Moore,sade,disabled,{},2023-11-12 21:01:05 UTC,2023-11-12 21:01:05 UTC
...
300000,giuseppe.dickens@rutherford-green.example40423qa,Jared40423ad,Mohr40423xa,lilliana.daniel40423ad,disabled,{},2023-11-12 21:41:07 UTC,2023-11-12 21:41:07 UTC
Once again, we obtained desirable results that are more readable than the direct output from the database.
What’s more?
There are a few additional points worth mentioning that may be useful when using this solution. It should be noted, that direct parsing of tables into CSV is a relatively simple example and usually we need something more.
Data from multiple tables
In this case what needs to be done is just create a proper SELECT query with any type of JOIN statement and replace it with SELECT used in copy_export
method.
SELECT orders.id AS order_id, orders.amount AS amount, users.id AS user_id FROM orders INNER JOIN users ON orders.user_id = users.id ORDER BY orders.id LIMIT 10
order_id,amount,user_id
1,10.00,374251
2,20.00,371735
3,5.00,58423
4,1.00,49674
5,2.00,398807
6,4.00,12144
7,4.00,21743
8,2.00,260913
9,1.00,191346
10,10.00,131893
Here we’ve smoothly gathered data from both orders
and users
tables.
Renaming headers
This one is also pretty easy and it is presented in SELECT query above. We can choose the proper headers by just aliasing columns in query - orders.id
to order_id
, orders.amount
to amount
and users.id
to user_id
.
Additional columns
There may be cases where the CSV value of a column doesn't directly correspond to our table data. For example, it might be the sum of two other columns or a doubled value of another column, or any other type of more abstracted ( or constant) values:
SELECT *, 2*orders.amount AS doubled_amount FROM orders
id,user_id,amount,created_at,updated_at,doubled_amount
1,374251,1000.00,2023-11-12 22:51:55.590875,2023-12-28 17:09:52.801258,2000.00
2,371735,100000.00,2023-11-12 22:51:55.67145,2023-12-28 17:10:51.519108,200000.00
3,58423,10.00,2023-11-12 22:51:55.681761,2023-11-12 22:51:55.681761,20.00
4,49674,20.00,2023-11-12 22:51:55.690487,2023-11-12 22:51:55.690487,40.00
5,398807,30.00,2023-11-12 22:51:55.696318,2023-11-12 22:51:55.696318,60.00
6,12144,10.00,2023-11-12 22:51:55.701053,2023-11-12 22:51:55.701053,20.00
7,21743,20.00,2023-11-12 22:51:55.707059,2023-11-12 22:51:55.707059,40.00
8,260913,30.00,2023-11-12 22:51:55.713588,2023-11-12 22:51:55.713588,60.00
9,191346,10.00,2023-11-12 22:51:55.720107,2023-11-12 22:51:55.720107,20.00
10,131893,10.00,2023-11-12 22:51:55.725918,2023-11-12 22:51:55.725918,20.00
Some useful resources
There's already an existing gem called postgres-copy. This gem allows your AR models to use the PostgreSQL COPY command to export/export data in CSV format. It's similar to the solution presented in this article and is based on the pg
gem methods. It's definitely worth checking out.
Another gem related to the COPY function is activerecord-copy. However, this gem only handles copying from CSV to the table, so it doesn't fully address the topic of this article.
Final words
According to my analysis, which is included in the article, the COPY TO
method demonstrates remarkable performance advantages, as evidenced by a significant speed difference in benchmarking. However, its straightforward nature may pose challenges when dealing with complex application logic and data modifications. These challenges need to be addressed using advanced SQL queries or direct data manipulations. On the other hand, the traditional CSV library, though slower, offers a more adaptable solution for complex data manipulations and logic transformations. The choice between the two methods ultimately hinges on the specific requirements of the application, with the COPY TO
solution excelling in raw export speed and the CSV library offering versatility in handling complex data scenarios. I highly recommend giving PostgreSQL solution a try, especially if your dataset is large but not overly complicated, or you just enjoy serving some efficient SQL.