2024

Postgresql

Ruby

Ruby on Rails

Exporting CSV files using COPY - Postgres Stories

Jarosław Kowalewski

Ruby Developer

This article explores efficient ways to export large datasets to CSV files using PostgreSQL's COPY function. The traditional way with Ruby CSV library is compared to direct PostgreSQL approach.

CSV is a simple and widely used file format for storing tabular data, such as spreadsheets or databases. In a CSV file, each line represents a row of data, and the values within a row are separated by commas (or other specified delimiters, such as semicolons or tabs). So it is pretty popular to import or export data from your system using CSV format.

Regular Ruby/Rails exports

In Ruby, you can use the built-in CSV library to export data to a CSV file. Here's a simple example:

require 'csv'
headers = ['Name', 'Age', 'Location']
data = [
  ['John Doe', 25, 'New York'],
  ['Jane Smith', 30, 'San Francisco'],
  ['Bob Johnson', 22, 'Chicago']
]

# or for ex. data = Person.all.pluck(:name, :age, :location) 

csv_file_path = 'example.csv'

CSV.open(csv_file_path, 'w', write_headers: true, headers: headers) do |csv|
  data.each do |row|
    csv << row
  end
end

puts "CSV file has been created at #{csv_file_path}"

It looks pretty nice and simple - we can replace easily our data with SQL’s selects, map them into separated rows and voilà - our export is ready. This is a popular, safe approach used widely in Ruby applications.

Clearly, for large datasets, traditional methods don't work well. A more memory-efficient approach, such as using the CSV.foreach method, can be considered. However, for thousands of records or more, this can still be time-consuming. A logical step would be to use a background worker like Sidekiq to address performance issues and reduce headaches. Additionally, gems like smarter_csv, which supports parallel processing, lazy loading, and in-house Sidekiq jobs, could be considered. However, these solutions require more and more resources to operate efficiently.

What is COPY function?

I will explore a slightly different approach offered by PostgreSQL. Moving some logic into the database layer could significantly improve performance. Let's examine this closely.

According to PostgreSQL definition:

COPY moves data between PostgreSQL tables and standard file-system files. COPY TO copies the contents of a table to a file, while COPY FROM copies data from a file to a table (appending the data to whatever is in the table already). COPY TO can also copy the results of a SELECT query.

Here we want to make a use of COPY TO. An example of this is:

COPY (table/SELECT query) TO '/example.csv';

There is a significant issue - the file data is saved on the database server. This can be problematic unless you have a dedicated server that houses both the database and file storage, which is not very common today. Alternatively, if you wish to save it client-side, you can use the psql \copy command. This command accesses data from Postgres STDOUT on the remote server and stores it in a file on the client side. However, it's not very convenient to call the psql client if we want to use it from an application perspective. Let's examine what the \copy command actually does:

\copy invokes COPY FROM STDIN or COPY TO STDOUT, and then fetches/stores the data in a file accessible to the psql client

So maybe we can make use of it and manage COPY TO STDOUT with some help of ActiveRecord and pg gem?

COPY vs CSV usage

Imagine we have a large set of users ( structure of table below) that needs to be exported. To clarify - I am using table with around 1,500,000 records:

create_table :users do |t|
  t.string :email, null: false, index: { unique: true }
  t.string :first_name
  t.string :last_name
  t.string :username, null: false, index: { unique: true }
  t.integer :status, default: 0
  t.jsonb :settings, default: {}
  t.timestamps
end

Let’s consider such piece of code with 2 methods - one is PostgreSQL COPY usage, second one is plain CSV Ruby library:

class User < ApplicationRecord
    enum status: { active: 0, archived: 1, blocked: 2, inactive: 3, part_active: 4, disabled: 5 }

    def self.copy_export
      output_file_path = 'output.csv'

      begin
        output_file = File.open(output_file_path,'w')
            query = <<-SQL
                COPY (SELECT * FROM users ORDER BY id LIMIT 300000) 
                TO STDOUT WITH CSV HEADER
            SQL

        ActiveRecord::Base.connection.raw_connection.copy_data(query) do
          while (row = ActiveRecord::Base.connection.raw_connection.get_copy_data)
            output_file.write(row)
          end
        end
      rescue StandardError => e
        puts "Error exporting data: #{e.message}"
      ensure
        output_file&.close
      end
    end

    def self.csv_export
      output_file_path = 'outfile.csv'
      result = User.all.order(:id).limit(300000);
      headers = User.column_names
      CSV.open(output_file_path, 'w', write_headers: true, headers: headers) do |csv|
        result.each do |row|
          csv << row.attributes.values
        end
      end
    end
end

Let's analyze the copy_export method. First, we establish a file path in the begin block. We then use ActiveRecord::Base.connection.raw_connection to connect directly to the pg gem layer. This allows us to call the [copy_data](https://www.rubydoc.info/gems/pg/PG%2FConnection:copy_data) method, which in turn calls the COPY function with its options - a SELECT query, CSV format, and HEADERS (the column names as CSV file headers). We then iterate over the rows returned to STDOUT with get_copy_data and write them to our file. The remaining code is a general rescue from errors and ensures that the file will be closed.

Next, I'll check the results and use the simple Ruby Benchmark library to measure the speed of the copy_export method. I'll use the query SELECT * FROM users ORDER BY id LIMIT 300000, which retrieves the first 300,000 rows from the users table.

3.1.2 :052 > Benchmark.realtime { User.copy_export }
 => 1.4312629997730255

id,email,first_name,last_name,username,status,settings,created_at,updated_at
1,mindy.lockman@fritsch.test,Susannah,Moore,sade,5,{},2023-11-12 21:01:05.604871,2023-11-12 21:01:05.604871
2,elmer@willms-stokes.example,Ammie,Bradtke,alison.schinner,4,{},2023-11-12 21:01:05.620698,2023-11-12 21:01:05.620698
3,susie@witting.test,Corinna,Kautzer,laree.walsh,0,{},2023-11-12 21:01:05.656736,2023-11-12 21:01:05.656736
4,quinton@becker.example,Titus,Crist,courtney,2,{},2023-11-12 21:01:05.663168,2023-11-12 21:01:05.663168
5,mariette_hauck@okon-howe.example,Perla,Ward,lindsay,2,{},2023-11-12 21:01:05.6688,2023-11-12 21:01:05.6688
6,dinah.collier@mertz-fritsch.test,Elvira,Hahn,cole_marvin,4,{},2023-11-12 21:01:05.673525,2023-11-12 21:01:05.673525
...
299997,lakita.cole@christiansen.example40420qa,Milford40420ad,Feil40420xa,calvin40420ad,0,{},2023-11-12 21:41:07.756856,2023-11-12 21:41:07.756856
299998,korey_schroeder@treutel-witting.example40421qa,Ezequiel40421ad,Rowe40421xa,ramiro40421ad,3,{},2023-11-12 21:41:07.760369,2023-11-12 21:41:07.760369
299999,roscoe@effertz-breitenberg.test40422qa,Piedad40422ad,Wilkinson40422xa,joya40422ad,1,{},2023-11-12 21:41:07.764228,2023-11-12 21:41:07.764228
300000,giuseppe.dickens@rutherford-green.example40423qa,Jared40423ad,Mohr40423xa,lilliana.daniel40423ad,5,{},2023-11-12 21:41:07.768311,2023-11-12 21:41:07.768311

The second method, csv_export, is straightforward. It defines the file path, retrieves the first 300,000 rows from the users table, obtains the headers' names and, row by row, populates the CSV file. Let's review the results and benchmark for this method.

3.1.2 :059 > Benchmark.realtime { User.csv_export }
 => 22.258816000074148

id,email,first_name,last_name,username,status,settings,created_at,updated_at
1,mindy.lockman@fritsch.test,Susannah,Moore,sade,disabled,{},2023-11-12 21:01:05 UTC,2023-11-12 21:01:05 UTC
2,elmer@willms-stokes.example,Ammie,Bradtke,alison.schinner,part_active,{},2023-11-12 21:01:05 UTC,2023-11-12 21:01:05 UTC
3,susie@witting.test,Corinna,Kautzer,laree.walsh,active,{},2023-11-12 21:01:05 UTC,2023-11-12 21:01:05 UTC
4,quinton@becker.example,Titus,Crist,courtney,blocked,{},2023-11-12 21:01:05 UTC,2023-11-12 21:01:05 UTC
5,mariette_hauck@okon-howe.example,Perla,Ward,lindsay,blocked,{},2023-11-12 21:01:05 UTC,2023-11-12 21:01:05 UTC
6,dinah.collier@mertz-fritsch.test,Elvira,Hahn,cole_marvin,part_active,{},2023-11-12 21:01:05 UTC,2023-11-12 21:01:05 UTC
...
299997,lakita.cole@christiansen.example40420qa,Milford40420ad,Feil40420xa,calvin40420ad,active,{},2023-11-12 21:41:07 UTC,2023-11-12 21:41:07 UTC
299998,korey_schroeder@treutel-witting.example40421qa,Ezequiel40421ad,Rowe40421xa,ramiro40421ad,inactive,{},2023-11-12 21:41:07 UTC,2023-11-12 21:41:07 UTC
299999,roscoe@effertz-breitenberg.test40422qa,Piedad40422ad,Wilkinson40422xa,joya40422ad,archived,{},2023-11-12 21:41:07 UTC,2023-11-12 21:41:07 UTC
300000,giuseppe.dickens@rutherford-green.example40423qa,Jared40423ad,Mohr40423xa,lilliana.daniel40423ad,disabled,{},2023-11-12 21:41:07 UTC,2023-11-12 21:41:07 UTC

Definitely it is noticeable that using Postgres COPY is much quicker - in this example it is around 15 times faster then Ruby library, so it is worth to consider, if we need to speed up generation of CSV files significantly.

Solving some problems

Things become complex when our application logic modifies data, and the data retrieved directly from the database differs from that in the application layer. In this example there are three fields with noticeable differences - status, created_at and updated_at. With status we get direct integer value of enum instead of string value, with created_at and updated_at format is changed.

There are two ways to resolve these issues. The first method involves manipulating data at the database level. The second method involves modifying each row at the get_copy_data layer before saving it.

Database layer

At the database layer it would look like this - query needs to return modified data:

def self.copy_export
  # the rest stays the same
    query = <<-SQL
      COPY (
        SELECT id, email, first_name, last_name, username, 
          CASE status
            WHEN #{statuses[:active]} THEN 'active'
            WHEN #{statuses[:archived]} THEN 'archived'
            WHEN #{statuses[:blocked]} THEN 'blocked'
            WHEN #{statuses[:inactive]} THEN 'inactive'
            WHEN #{statuses[:part_active]} THEN 'part_active'
            WHEN #{statuses[:disabled]} THEN 'disabled'
            ELSE NULL
          END AS status, settings, TO_CHAR(created_at::timestamp AT TIME ZONE 'UTC', 'YYYY-MM-DD HH24:MI:SS UTC') AS created_at,
          TO_CHAR(updated_at::timestamp AT TIME ZONE 'UTC', 'YYYY-MM-DD HH24:MI:SS UTC') AS updated_at
        FROM users
        ORDER BY id
        LIMIT 300000
      ) TO STDOUT WITH CSV HEADER
    SQL
  # the rest stays the same
end

After that, I can run copy_export again:

3.1.2 :084 > Benchmark.realtime { User.copy_export }
 => 1.49278200045228

And here is our CSV file:

id,email,first_name,last_name,username,status,settings,created_at,updated_at
1,mindy.lockman@fritsch.test,Susannah,Moore,sade,disabled,{},2023-11-12 21:01:05 UTC,2023-11-12 21:01:05 UTC
...
300000,giuseppe.dickens@rutherford-green.example40423qa,Jared40423ad,Mohr40423xa,lilliana.daniel40423ad,disabled,{},2023-11-12 21:41:07 UTC,2023-11-12 21:41:07 UTC

Now, our CSV generated via COPY method looks the same as generated from Ruby library.

Rubyish solution

Second approach is to manipulate on already generated rows - it is possible to do that, because get_copy_data returns each row as Ruby string ( it may be also returned as array - check CopyRow decoder). Here I present some rough implementation of operating on Ruby data:

def self.copy_export
  output_file_path = 'output.csv'

  begin
    output_file = File.open(output_file_path, 'w')
    i = 0

    ActiveRecord::Base.connection.raw_connection.copy_data("COPY (SELECT * FROM users ORDER BY id LIMIT 300000) TO STDOUT WITH CSV HEADER") do
      while (row = ActiveRecord::Base.connection.raw_connection.get_copy_data)
        if i > 0
          values = row.split(",")
          values[5] = User.statuses.key(values[5].to_i)
          values[7] = DateTime.parse(values[7]).strftime("%Y-%m-%d %H:%M:%S UTC")
          values[8] = DateTime.parse(values[8]).strftime("%Y-%m-%d %H:%M:%S UTC")
          output_file.puts(values.join(','))
        else
          output_file.write(row)
        end

        i += 1
      end
    end
  rescue StandardError => e
    puts "Error exporting data: #{e.message}"
  ensure
    output_file&.close
  end
end

In short, I skip the row with headers and work with strings mapped to an array so that I can manipulate the values I want to change.

Again, I run copy_export once more time:

3.1.2 :232 > Benchmark.realtime { User.copy_export }
 => 5.370280999690294

Noticeably, speed of generation my CSV file went down because of direct Ruby manipulation on its data - but still, it is much better then using Ruby library itself.

id,email,first_name,last_name,username,status,settings,created_at,updated_at
1,mindy.lockman@fritsch.test,Susannah,Moore,sade,disabled,{},2023-11-12 21:01:05 UTC,2023-11-12 21:01:05 UTC
...
300000,giuseppe.dickens@rutherford-green.example40423qa,Jared40423ad,Mohr40423xa,lilliana.daniel40423ad,disabled,{},2023-11-12 21:41:07 UTC,2023-11-12 21:41:07 UTC

Once again, we obtained desirable results that are more readable than the direct output from the database.

What’s more?

There are a few additional points worth mentioning that may be useful when using this solution. It should be noted, that direct parsing of tables into CSV is a relatively simple example and usually we need something more.

Data from multiple tables

In this case what needs to be done is just create a proper SELECT query with any type of JOIN statement and replace it with SELECT used in copy_export method.

SELECT orders.id AS order_id, orders.amount AS amount, users.id AS user_id FROM orders INNER JOIN users ON orders.user_id = users.id ORDER BY orders.id LIMIT 10

order_id,amount,user_id
1,10.00,374251
2,20.00,371735
3,5.00,58423
4,1.00,49674
5,2.00,398807
6,4.00,12144
7,4.00,21743
8,2.00,260913
9,1.00,191346
10,10.00,131893

Here we’ve smoothly gathered data from both orders and users tables.

Renaming headers

This one is also pretty easy and it is presented in SELECT query above. We can choose the proper headers by just aliasing columns in query - orders.id to order_id, orders.amount to amount and users.id to user_id.

Additional columns

There may be cases where the CSV value of a column doesn't directly correspond to our table data. For example, it might be the sum of two other columns or a doubled value of another column, or any other type of more abstracted ( or constant) values:

SELECT *, 2*orders.amount AS doubled_amount FROM orders

id,user_id,amount,created_at,updated_at,doubled_amount
1,374251,1000.00,2023-11-12 22:51:55.590875,2023-12-28 17:09:52.801258,2000.00
2,371735,100000.00,2023-11-12 22:51:55.67145,2023-12-28 17:10:51.519108,200000.00
3,58423,10.00,2023-11-12 22:51:55.681761,2023-11-12 22:51:55.681761,20.00
4,49674,20.00,2023-11-12 22:51:55.690487,2023-11-12 22:51:55.690487,40.00
5,398807,30.00,2023-11-12 22:51:55.696318,2023-11-12 22:51:55.696318,60.00
6,12144,10.00,2023-11-12 22:51:55.701053,2023-11-12 22:51:55.701053,20.00
7,21743,20.00,2023-11-12 22:51:55.707059,2023-11-12 22:51:55.707059,40.00
8,260913,30.00,2023-11-12 22:51:55.713588,2023-11-12 22:51:55.713588,60.00
9,191346,10.00,2023-11-12 22:51:55.720107,2023-11-12 22:51:55.720107,20.00
10,131893,10.00,2023-11-12 22:51:55.725918,2023-11-12 22:51:55.725918,20.00

Some useful resources

There's already an existing gem called postgres-copy. This gem allows your AR models to use the PostgreSQL COPY command to export/export data in CSV format. It's similar to the solution presented in this article and is based on the pg gem methods. It's definitely worth checking out.

Another gem related to the COPY function is activerecord-copy. However, this gem only handles copying from CSV to the table, so it doesn't fully address the topic of this article.

Final words

According to my analysis, which is included in the article, the COPY TO method demonstrates remarkable performance advantages, as evidenced by a significant speed difference in benchmarking. However, its straightforward nature may pose challenges when dealing with complex application logic and data modifications. These challenges need to be addressed using advanced SQL queries or direct data manipulations. On the other hand, the traditional CSV library, though slower, offers a more adaptable solution for complex data manipulations and logic transformations. The choice between the two methods ultimately hinges on the specific requirements of the application, with the COPY TO solution excelling in raw export speed and the CSV library offering versatility in handling complex data scenarios. I highly recommend giving PostgreSQL solution a try, especially if your dataset is large but not overly complicated, or you just enjoy serving some efficient SQL.

Jarosław Kowalewski

Ruby Developer

Check my Twitter