Generating Large Excel Files Using Spring Batch, Part Three

Jonny Hackett Java, Spring, Spring Batch, Technology Snapshot 9 Comments

While working for a client recently, I was given a small project to produce a report that would help reconcile differences in data that existed in four to five different database sources. The requirements specified a need to compare roughly 40 fields from each of these sources against each other, and to report the differences in MS Excel format, which included details regarding how the data should be displayed in the spreadsheet.

As it turned out, the challenge was not about the amount of data being processed as I originally had suspected. Instead, the challenge became how to create a potentially large Excel file without causing memory meltdown on the server hardware.

Hopefully by sharing my experience here, it might save a little time for someone else, and thus give back a little bit to the developer community.

If you haven’t read the previous posts in this series (Introducing Spring Batch and Getting Started With Spring Batch), they serve as a quick start guide and simple example for learning the basics of Spring Batch. They also serve as the starting point for this article’s example code.

The Process

For our input data to this job, we’re going to be reading in the data provided from the following URL. It will generate a list of the NYSE traded stock data in CSV format. You can also click this link to download a physical file in order to take a look at the data format so you can see what to expect: http://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nasdaq&render=download

NOTE: If using Internet Explorer, you’ll need to visit the following URL: http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NYSE and look for the CSV download link at the top of the data results. Click the link “Download this list” and there may be a popup window in which you have to enter some text in order to get the download. This doesn’t appear to be an issue with Google Chrome or Firefox, nor is it an issue reading the download in Spring Batch as an input resource.

The reader for this step will be set up almost identically to the reader example in Part Two’s Getting Started With Spring Batch, because it is a CSV file in which we’re specifying a URL as the resource (and not an actual physical file that we are reading in). The only difference between this example and the configuration in Part 2’s example is that we need to define a custom FieldSetMapper specific to the type of data we are mapping from the input file.

Below are the bean configurations required to set up the reader for the step to convert the incoming file from CSV into Excel format. If you downloaded the stock data file and examined its contents, you should have noticed that the first line contains header information that we should not be mapping to a data object. This header information is skipped by adding the “linesToSkip” property to the FlatFileItemReader bean definition as you see below:

<bean name="stockDataReader"
	class="org.springframework.batch.item.file.FlatFileItemReader">
	<property name="resource"
		value="http://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&amp;exchange=nasdaq&render=download" />
	<property name="lineMapper" ref="stockDataLineMapper" />
	<property name="linesToSkip" value="1" />
</bean>

<bean name="stockDataLineMapper"
	class="org.springframework.batch.item.file.mapping.DefaultLineMapper">
	<property name="fieldSetMapper" ref="stockDataFieldMapper" />
	<property name="lineTokenizer" ref="stockDataLineTokenizer" />
</bean>

<bean name="stockDataLineTokenizer"		class="org.springframework.batch.item.file.transform.DelimitedLineTokenizer" />

Secondly, we need to create the data object that we will map the incoming file record to. For this particular file, it will look like this:

package com.keyhole.example.poi;

import java.io.Serializable;
import java.math.BigDecimal;

public class StockData implements Serializable {

	private static final long serialVersionUID = 4383231542218565966L;
	private String symbol;
	private String name;
	private BigDecimal lastSale;
	private BigDecimal marketCap;
	private String adrTso;
	private String ipoYear;
	private String sector;
	private String industry;
	private String summaryUrl;

	// getters and setters removed for brevity

}

Now that we have defined the data object that our file will be mapped to, we need to create the custom FieldSetMapper implementation. It should look like this:

package com.keyhole.example.poi;

import java.math.BigDecimal;

import org.springframework.batch.item.file.mapping.FieldSetMapper;
import org.springframework.batch.item.file.transform.FieldSet;
import org.springframework.stereotype.Component;
import org.springframework.validation.BindException;

@Component("stockDataFieldMapper")
public class StockDataFieldSetMapper implements FieldSetMapper<StockData> {

	public StockData mapFieldSet(FieldSet fieldSet) throws BindException {
		StockData data = new StockData();
		data.setSymbol(fieldSet.readString(0));
		data.setName(fieldSet.readString(1));

		String lastSaleVal = fieldSet.readString(2);
		if ("n/a".equals(lastSaleVal)) {
			data.setLastSale(BigDecimal.ZERO);
		} else {
			data.setLastSale(new BigDecimal(lastSaleVal));
		}

		data.setMarketCap(fieldSet.readBigDecimal(3));
		data.setAdrTso(fieldSet.readString(4));
		data.setIpoYear(fieldSet.readString(5));
		data.setSector(fieldSet.readString(6));
		data.setIndustry(fieldSet.readString(7));
		data.setSummaryUrl(fieldSet.readString(8));
		return data;
	}
}

After defining the configuration and implementation of reading the stock data file, we’re ready to move on to implementing our Excel ItemWriter. The two most commonly used open source Java APIs are Apache POI and JExcelAPI. As most people might attest, generating large files typically result in a high memory footprint, as they require building the entire Excel workbook in memory prior to writing out the file.

However, beginning with Apache POI version 3.8-beta3 in June of 2011, developers now have the option to use a low-memory footprint Excel API. Apache POI also has additional advantages in that it is continually evolving and has a strong development community, ensuring that it will be maintained for the foreseeable future.

  • If you are using Maven and an Eclipse-based IDE like SpringSource Tool Suite (STS), it’s very simple to obtain the Apache POI API for your project. Setting up a Spring Batch project in STS was detailed in Getting Started With Spring Batch so we won’t go into detail regarding project setup. Right click on the project in STS, select “Maven” and then select “Add Dependency.” In the dialogue box for the search entry, you’ll want to enter POI and look for the result that corresponds to the org.apache.poi package. After that, you’ll need to do the same process for poi-ooxml, likewise selecting the result that corresponds to the org.apache.poi package.
  • If you are not using Maven, you’ll need to visit the Apache POI website to manually download the latest version and move the required jars into your lib directory manually. This will include the poi jar, poi-ooxml jar and all of its associated jars. Details of each can be found on the Apache website.

This new Excel API from Apache is named SXSSF. It is an API-compatible streaming extension of XSSF, which is used to create newer Excel 2007-based OOXML (.xlsx) files. It achieves this by limiting access to the number of rows in memory within a sliding window. For example: if you define the sliding window as 50, when the 51st row is created, the first row that is in the window is written to disk. This operation repeats as each new row is created and the oldest row in the window is written to disk. The older rows (that are no longer in the window) become inaccessible since they have been flushed from memory.

To begin using this streaming version SXSSF, it is really just as simple as this:

	Workbook wb = new SXSSFWorkbook(100);

By instantiating an SXXFWorkbook object and calling the constructor that accepts an integer as the parameter, we have defined our workbook for streaming with a sliding window of 100 rows.

Since the goal of this example is to reduce the memory footprint, we’re going to be processing the file in chunks of 500. In order to process the file in chunks like this, we’ll need to create our Excel workbook once at the beginning of the step and close the output stream at the very end of the step, while writing the data out in between. To do this, we’re going to create our ItemWriter with methods to be processed before the step and after the step, and implemented using Spring Batch’s built-in annotations.

First, here’s the method that will be created to handle the BeforeStep interception. I have left out the details of how the title and header information were created, but they will be included in the complete code listing near the end.

@BeforeStep
	public void beforeStep(StepExecution stepExecution) {

		String dateTime = DateFormatUtils.format(Calendar.getInstance(),
				"yyyyMMdd_HHmmss");
		outputFilename = FILE_NAME + "_" + dateTime + ".xlsx";

		workbook = new SXSSFWorkbook(100);
		Sheet sheet = workbook.createSheet("Testing");
		sheet.createFreezePane(0, 3, 0, 3);
		sheet.setDefaultColumnWidth(20);

		addTitleToSheet(sheet);
		currRow++;
		addHeaders(sheet);
		initDataStyle();

	}

The method can be named anything, but by convention I normally name the method BeforeStep just to stay consistent with the purpose and use of the method. By annotating that method with @BeforeStep, this tells Spring Batch that before the step is executed, this method should be called and the StepExecution passed in as a parameter. Typically these methods are used to configure resources that will be used by the bean, whether that bean is a reader, processor, writer or tasklet.

It’s also important to note that if your bean is extending one or more classes, then there can only be one @BeforeStep or @AfterStep annotated method. The code listing here shows that we’re defining the file name and instantiating the workbook with a row sliding window of 100 (which is the same as the default but listed here to show how that would be defined). We also need to create the first sheet, add a title / header info to that sheet, and initialize the cell style that will be used for the data output.

Here is the method on the writer that will be called in the AfterStep phase of job execution:

@AfterStep
	public void afterStep(StepExecution stepExecution) throws IOException {
		FileOutputStream fos = new FileOutputStream(outputFilename);
		workbook.write(fos);
		fos.close();
	}

Just as with the BeforeStep annotated method, this AfterStep annotated method is typically used to wrap up necessary items after a step has completed. Just as their names imply, these methods are called before the step begins to execute and after the step has completed executing. The code listed here will create the output stream necessary for the Excel workbook to write to. And, once that has completed, we need to close the output stream. What’s important to note here is that when we are calling workbook.write(fos) at this point, it’s taking the temp files that were used to stream the Excel data out to disk and assembling them back into an Excel .xlsx file.

So, now that we’ve defined setting up the Excel workbook and closing it out, it’s time to take care of the method that actually takes the data that was read from the input source and converts it into the rows and cells that will make up the detailed data of the Excel file.

Here’s the code listing of Write method:

@Override
	public void write(List<? extends StockData> items) throws Exception {

		Sheet sheet = workbook.getSheetAt(0);

		for (StockData data : items) {
			for (int i = 0; i < 300; i++) {
				currRow++;
				Row row = sheet.createRow(currRow);
				createStringCell(row, data.getSymbol(), 0);
				createStringCell(row, data.getName(), 1);
				createNumericCell(row,
data.getLastSale().doubleValue(), 2);
				createNumericCell(row,
data.getMarketCap().doubleValue(), 3);
				createStringCell(row, data.getAdrTso(), 4);
				createStringCell(row, data.getIpoYear(), 5);
				createStringCell(row, data.getSector(), 6);
				createStringCell(row, data.getIndustry(), 7);
				createStringCell(row, data.getSummaryUrl(), 8);
			}
		}
	}

In this Write method, the code is pretty straightforward and simple. As we are looping through the list of StockData objects that were mapped from our input file, we are creating a new row and its cells for each item of data. Since the input file is only a little more than a couple of thousand rows, this wouldn’t be a good test of generating a large Excel file. That’s why you see the additional loop that will create 300 rows for each of the items we’re going to write. By the time the job finishes, we will have generated an Excel file that has a little over 800,000 rows — just to prove we can do it, not that you should.

The two methods below are convenience methods for the actual creation of each individual cell within the row to simplify some repeated code:

	private void createStringCell(Row row, String val, int col) {
		Cell cell = row.createCell(col);
		cell.setCellType(Cell.CELL_TYPE_STRING);
		cell.setCellValue(val);
	}

	private void createNumericCell(Row row, Double val, int col) {
		Cell cell = row.createCell(col);
		cell.setCellType(Cell.CELL_TYPE_NUMERIC);
		cell.setCellValue(val);
	}

Putting it all together, here is the complete code listing for the StockDataExcelWriter. One important note regarding this class is its use of the @Scope (“step”) Spring annotation. By default, Spring beans are created as singletons when they are loaded into the Spring context. Since we are holding on to state with a few items (such as the current row being written, the workbook object, and a re-usable cell style), we need the framework to instantiate this StockDataExcelWriter as needed, once per step execution. Otherwise, we could potentially run into some thread-safe issues if this job were to run simultaneously.

package com.keyhole.example.poi;

import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Calendar;
import java.util.List;

import org.apache.commons.lang3.time.DateFormatUtils;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.CellStyle;
import org.apache.poi.ss.usermodel.Font;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.util.CellRangeAddress;
import org.apache.poi.xssf.streaming.SXSSFWorkbook;
import org.springframework.batch.core.StepExecution;
import org.springframework.batch.core.annotation.AfterStep;
import org.springframework.batch.core.annotation.BeforeStep;
import org.springframework.batch.item.ItemWriter;
import org.springframework.context.annotation.Scope;
import org.springframework.stereotype.Component;

@Component("stockDataExcelWriter")
@Scope("step")
public class StockDataExcelWriter implements ItemWriter<StockData> {

	private static final String FILE_NAME = "/data/example/excel/StockData";
	private static final String[] HEADERS = { "Symbol", "Name", "Last Sale",
			"Market Cap", "ADR TSO", "IPO Year", "Sector", "Industry",
			"Summary URL" };

	private String outputFilename;
	private Workbook workbook;
	private CellStyle dataCellStyle;
	private int currRow = 0;

	private void addHeaders(Sheet sheet) {

		Workbook wb = sheet.getWorkbook();

		CellStyle style = wb.createCellStyle();
		Font font = wb.createFont();

		font.setFontHeightInPoints((short) 10);
		font.setFontName("Arial");
		font.setBoldweight(Font.BOLDWEIGHT_BOLD);
		style.setAlignment(CellStyle.ALIGN_CENTER);
		style.setFont(font);

		Row row = sheet.createRow(2);
		int col = 0;

		for (String header : HEADERS) {
			Cell cell = row.createCell(col);
			cell.setCellValue(header);
			cell.setCellStyle(style);
			col++;
		}
		currRow++;
	}

	private void addTitleToSheet(Sheet sheet) {

		Workbook wb = sheet.getWorkbook();

		CellStyle style = wb.createCellStyle();
		Font font = wb.createFont();

		font.setFontHeightInPoints((short) 14);
		font.setFontName("Arial");
		font.setBoldweight(Font.BOLDWEIGHT_BOLD);
		style.setAlignment(CellStyle.ALIGN_CENTER);
		style.setFont(font);

		Row row = sheet.createRow(currRow);
		row.setHeightInPoints(16);

		String currDate = DateFormatUtils.format(Calendar.getInstance(),
				DateFormatUtils.ISO_DATETIME_FORMAT.getPattern());

		Cell cell = row.createCell(0, Cell.CELL_TYPE_STRING);
		cell.setCellValue("Stock Data as of " + currDate);
		cell.setCellStyle(style);

		CellRangeAddress range = new CellRangeAddress(0, 0, 0, 7);
		sheet.addMergedRegion(range);
		currRow++;

	}

	@AfterStep
	public void afterStep(StepExecution stepExecution) throws IOException {
		FileOutputStream fos = new FileOutputStream(outputFilename);
		workbook.write(fos);
		fos.close();
	}

	@BeforeStep
	public void beforeStep(StepExecution stepExecution) {
		System.out.println("Calling beforeStep");

		String dateTime = DateFormatUtils.format(Calendar.getInstance(),
				"yyyyMMdd_HHmmss");
		outputFilename = FILE_NAME + "_" + dateTime + ".xlsx";

		workbook = new SXSSFWorkbook(100);
		Sheet sheet = workbook.createSheet("Testing");
		sheet.createFreezePane(0, 3, 0, 3);
		sheet.setDefaultColumnWidth(20);

		addTitleToSheet(sheet);
		currRow++;
		addHeaders(sheet);
		initDataStyle();

	}

	private void initDataStyle() {
		dataCellStyle = workbook.createCellStyle();
		Font font = workbook.createFont();

		font.setFontHeightInPoints((short) 10);
		font.setFontName("Arial");
		dataCellStyle.setAlignment(CellStyle.ALIGN_LEFT);
		dataCellStyle.setFont(font);
	}

	@Override
	public void write(List<? extends StockData> items) throws Exception {

		Sheet sheet = workbook.getSheetAt(0);

		for (StockData data : items) {
			for (int i = 0; i < 300; i++) {
				currRow++;
				Row row = sheet.createRow(currRow);
				createStringCell(row, data.getSymbol(), 0);
				createStringCell(row, data.getName(), 1);
				createNumericCell(row, data.getLastSale().doubleValue(), 2);
				createNumericCell(row, data.getMarketCap().doubleValue(), 3);
				createStringCell(row, data.getAdrTso(), 4);
				createStringCell(row, data.getIpoYear(), 5);
				createStringCell(row, data.getSector(), 6);
				createStringCell(row, data.getIndustry(), 7);
				createStringCell(row, data.getSummaryUrl(), 8);
			}
		}
	}

	private void createStringCell(Row row, String val, int col) {
		Cell cell = row.createCell(col);
		cell.setCellType(Cell.CELL_TYPE_STRING);
		cell.setCellValue(val);
	}

	private void createNumericCell(Row row, Double val, int col) {
		Cell cell = row.createCell(col);
		cell.setCellType(Cell.CELL_TYPE_NUMERIC);
		cell.setCellValue(val);
	}

}

Here’s the Spring Batch configuration for the job:

<batch:job id="PoiExcelConverter">
	<batch:step id="convertDataToExcel">
		<batch:tasklet transaction-manager="transactionManager">
<batch:chunk reader="stockDataReader"
writer="stockDataExcelWriter"
commit-interval="500" />
		</batch:tasklet>
	</batch:step>
</batch:job>

And now that we have proven that we can create huge Excel files, there are a few limitations to this approach as listed on the Apache POI website:

  • Only a limited number of rows are available at a point in time, which are the rows that remain in the window and haven’t been written to disk yet.
  • Sheet.clone() is not supported.
  • Formula evaluation is not supported.

Troubleshooting

There is one issue that I came across that took me a little while to resolve. If you use the OOXML formats with POI you might come across this error:

“Excel found unreadable content in ‘PoiTest.xlsx’. Do you want to recover the contents of this workbook? If you trust the source of this workbook, click Yes.”

And upon clicking “Yes,” it is followed up by an error dialogue similar to this:

This usually means that you have made a mistake in defining a style somewhere in your code. By clicking the link to the log at the bottom, there’s a good chance you’ll get pointed in the right direction of where the issue is. Hopefully this little nugget of information will save you a little time researching the error.

Conclusion

So now that we’re done, we have shown that there is a viable way of generating extremely large Excel workbooks without bringing the server to its knees.

The real question now becomes: do you really need this in Excel format? Does this 800,000+ row workbook provide any real value to the business? Just because you can, doesn’t always mean that you should. But sometimes it’s just fun to find out if you can.

— Jonny Hackett, asktheteam@keyholesoftware.com

Spring Batch Blog Series

Part One: Introducing Spring Batch

Part Two:  Getting Started With Spring Batch

Part Three: Generating Large Excel Files Using Spring Batch

Scaling Spring Batch – Step Partitioning

Spring Batch Unit Testing and Mockito

Spring Batch – Replacing XML Job Configuration With JavaConfig

References

Apache POI (Excel): http://poi.apache.org/spreadsheet/index.html

JExcel API: http://jexcelapi.sourceforge.net

Spring Batch: http://static.springsource.org/spring-batch/


About the Author
Jonny Hackett

Jonny Hackett

Twitter

Jonny is a Senior Software Engineer and Mentor with 15+ years of experience in IT. As a Java Developer, avid SportingKC fan, and photographer (check him out on www.Facebook.com/no9photography.) , Jonny is also our resident Spring Batch expert.


Share this Post

Comments 9

  1. Pingback: Introducing Spring Batch, Part One « Keyhole Software

  2. Pingback: Getting Started With Spring Batch, Part Two « Keyhole Software

  3. Nice Article using Spring Batch with POI, but I am looking for POI generating .xls files , We have requirement of generating .xls file in our web app with huge list of Trades , like 60,000 to 1,00,000 . How will Spring Batch address that with using HSSFWorkBook instead of using SXSSFWorkbook ?

    1. Hi Rahul, thanks for the question. You can still use Spring Batch and the HSSF model together, but unfortunately Spring Batch won’t address the memory issues associated with large .xls files. Because the HSSF model is based upon the older binary Excel version, it still requires you to create the entire workbook object in memory prior to writing the file out. By using the SXSSF model you don’t have to keep the entire Excel workbook in memory and instead it will periodically write out portions of the file keeping the memory low. There’s a small chart on the bottom of the POI Spreadsheet page that lists the different models and their features ( http://poi.apache.org/spreadsheet/index.html ). Based upon that chart, the SXSSF model is the only one that supports buffered streaming when writing files.

  4. Pingback: Scaling Spring Batch – Step Partitioning | Keyhole Software

  5. There seems to be problem with the 2 features, i.e., sliding window of excel(SXSSFWorkbook) and commit interval of spring batch.

    I tried writing 50K records. For every 10K I am creating a new sheet. So my expected output should be 10K in each.

    But with SXSSFWorkbook(500) and commit-interval=500, the data seems to break in wrong way in the 2nd to last sheets.

    Although the data is written correctly when its going into one sheet.

    [Note: I am doing this workaround just because of the 1048575 limit of SXSSFWorkbook in excel.]

    Sid.

    1. Thanks Sid, I’ll check this out a little later in the week and see if I can replicate the issue. Are you using this exact code plus a few modifications to write the new worksheets? Or are there quite a few differences?

      -jonny

  6. Thanks Jonny for getting back. There is change from my last update.

    Just a minor change in code near the creating row logic. Here is code I have changed. Besides I am not using the inner loop of 300, as my reader query already fetches around 4252362 records. Hope you can find the reason I am lossing data in the subsequent sheets. Is it bcos of some clash in sliding window of SXSSFWorkbook and commit-interval in spring batch.

    Row row=null;
    try {
    row = sheet.createRow(currRow);
    } catch (IllegalArgumentException iae) {
    String strMessage=”Invalid row number (1048576) outside allowable range (0..1048575)”;
    if(strMessage.equals(iae.getMessage())){
    System.out.println(“Exceeded limit”);
    currRow=0;
    sheet = workbook.createSheet();
    row = sheet.createRow(currRow);
    }
    }

    To look at the above problem at a smaller scope. I have a query which returns only 48K records. Then I change the above logic to code below. The output(xlsx) I get is 10K records in first sheet but only 3 rows in subsequent 4 sheets.

    if(currRow % 10000==0){
    System.out.println(“1 million crossed”);
    currRow=0;
    sheet = workbook.createSheet();
    }

    In both the scenarios: commit-interval=500 and slide window for SXSSFWorkbook(100).

  7. Hi Jonny Hackett,

    It is really an awesome blog and thanks for sharing your experience.
    Actually we are trying to have a custom item reader for excel sheets to load the data from excel to db. But I guess we are missing something some where while customizing. Could you please help us regarding this and provide us some sample program on this. It would be of great help.

    Thanks.

Leave a Reply