Spring Data Jpa bulk insert

1. 동기
2. saveAllAndFlush
3. Persistable
- 3.1. 레퍼런스 보기 (엔티티 상태 검사 전략)
4. batch_size
5. generate_statistics
6. 프로퍼티 정리
7. ActionQueue
8. 후기

1 동기

Spring Data Jpa 를 사용하면서 bulk insert를 하고 싶은 경우가 있다. 단순히 saveAll()을 사용하면 된다고 생각했는데, 그렇지 않았다.

엔티티를 만들어보자.

@Entity(name = "item")
data class Item(@javax.persistence.Id val id: Long, @Column(name = "name") val name: String)

interface ItemRepository : CrudRepository<Item, Long> {
}

테스트를 만들자.

@DataJpaTest kotlin
class ItemRepositoryTest {

    @Autowired
    private lateinit var repository: ItemRepository

    @Test
    fun test() {
	val items = listOf(Item(id = 1, name = "name1"), Item(id = 2, name = "name2"))

	repository.saveAll(items)
    }
}

리턴값은 다음과 같다.

Hibernate: select item0_.id as id1_0_0_, item0_.name as name2_0_0_ from item item0_ where item0_.id=?
Hibernate: select item0_.id as id1_0_0_, item0_.name as name2_0_0_ from item item0_ where item0_.id=?

저장은 하지 않고 조회만 한다. findAll을 수행해보자. 그러면 저절로 flush가 된다.

@Test
fun test() {
    val items = listOf(Item(id = 1, name = "name1"), Item(id = 2, name = "name2"))

    repository.saveAll(items)

    repository.findAll()
}

--- return ---

Hibernate: select item0_.id as id1_0_0_, item0_.name as name2_0_0_ from item item0_ where item0_.id=?
Hibernate: select item0_.id as id1_0_0_, item0_.name as name2_0_0_ from item item0_ where item0_.id=?
Hibernate: insert into item (name, id) values (?, ?)
Hibernate: insert into item (name, id) values (?, ?)
Hibernate: select item0_.id as id1_0_, item0_.name as name2_0_ from item item0_

2 saveAllAndFlush

하지만 조회하지 않고 곧바로 flush가 되었으면 좋겠다.

그렇다면 CrudRepository 대신 JpaRepository 를 사용하면 된다.

interface ItemRepository : JpaRepository<Item, Long> {
}

테스트도 바꾼다

@Test
fun test() {
    val items = listOf(Item(id = 1, name = "name1"), Item(id = 2, name = "name2"))

    repository.saveAllAndFlush(items)
}
--- return ---
Hibernate: select item0_.id as id1_0_0_, item0_.name as name2_0_0_ from item item0_ where item0_.id=?
Hibernate: select item0_.id as id1_0_0_, item0_.name as name2_0_0_ from item item0_ where item0_.id=?
Hibernate: insert into item (name, id) values (?, ?)
Hibernate: insert into item (name, id) values (?, ?)

3 Persistable

이제는 select문을 제거해보자. select문이 제거되려면 ID를 JPA에서 만들필요가 없어야 한다. 그리고 우리는 insert만 할 것이니까. update에 대한 고민도 없어야 한다.

그렇다면 Entity에 Persistable 을 구현하면 된다.

@Entity(name = "item")
data class Item(
    @Id val id: Long,
    @Column(name = "name") val name: String)
    : Persistable<Long> {
    override fun getId(): Long {
	return id
    }

    override fun isNew(): Boolean {
	return true
    }
}
--- return ---
Hibernate: insert into item (name, id) values (?, ?)
Hibernate: insert into item (name, id) values (?, ?)

3.1 레퍼런스 보기 (엔티티 상태 검사 전략)

스프링 문서를 구경해보자. Entity State-detection Strategies 를 보면 Entity의 상태를 다루는 전략은 3가지가 있다.

3.1.1 Version-Property and Id-Property inspection(default)

기본적으로 Spring Data Jpa는 non-primitivve type이 아닌 version-property가 있는지 먼저 검사한다. version-property가 존재한다면 해당 속성 값이 null이면 새로 생성된 것으로 간주한다. 이러한 version-property가 없으면 Spring Data JPA는 지정된 엔티티의 식별자 프로퍼티(identifier property)를 검사한다. 식별자 속성이 null이면 해당 엔티티는 새 엔티티로 간주된다. 그렇지 않으면 새 엔티티가 아닌 것으로 간주한다.

이 방식은

수동으로 할당된 식별자를 사용하고
식별자가 항상 null이 아니고
version-property를 사용하지 않는

Entity를 위한 옵션이 아니다.

이 시나리오에서 일반적인 패턴은 새 인스턴스를 나타내는 일시적 플래그가 기본값으로 설정된 공통 기본 클래스(command base class)에 transient 플래그를 추가해서 기본값으로 새 인스턴스임을 나타내고, JPA 라이프사이클 콜백에서 flag를 변환하는 방식을 사용한다. 다음은 공식문서에서 보여주는 예제이다.

@MappedSuperclass
// Example 54. A base class for entities with manually assigned identifiers
public abstract class AbstractEntity<ID> implements Persistable<ID> {
  // transient 니까 저장안됨.
  @Transient
  private boolean isNew = true;

  // isNew() 메소드에 따라 EntityManager.persist() 를 쓸지 merge()를 쓸지 결정한다.
  // 식별자가 없으면 persist
  // 식별자가 있으면 merge : merge는 식별자로 조회를 해본 후, 없으면 insert, 있으면 update
  @Override
  public boolean isNew() {
    return isNew;
  }

  // save 에 대한 Repository 호출 또는
  // Persistence provider에 의해 인스턴스 생성 후 플래그가 기존 엔티티임을 나타내도록 ~JPA Entity Callbacks~ 을 사용하는 메서드를 선언.
  @PrePersist
  @PostLoad
  void markNotNew() {
    this.isNew = false;
  }
}

3.1.2 Implementing Persistable

엔티티가 Persistable~이면 ~isNew() 메소드를 사용하여 새 엔티티 여부를 결정하도록 위임한다. Persistable 인터페이스 JavaDoc

3.1.3 Implementing `EntityInformation`

JpaRepositoryFactory 의 서브클래스를 생성하고 getEntityInformation 메소드를 오버라이드하면 SimpleJpaRepository 구현에 사용되는 EntityInformation 추상화를 커스텀할 수 있다. 그런 다음 JpaRepositoryFactory 의 커스텀구현(EntityInformation)을 스프링 빈으로 전달하면 된다. (이 작업은 필요 없을 수도 있음) JavaDoc

4 `batch_size`

배치 사이즈를 설정하면 공짜로 성능이 증가하는 효과가 있는 것으로 보인다. 링크,

이제 saveAll 을 사용함에도 bulk insert가 되도록 해야 한다. 그러기 위해서는 하이버네이트에 batch_size 를 설정해야 한다. 우리는 Spring Data JPA를 사용하기 때문에, application.yml 의 spring.jpa.properties.hibernate.jdbc.batch_size 를 설정해야한다.

spring.jpa.properties.hibernate.jdbc.batch_size=2

혹시 테스트를 위해 아래 처럼 만들어 보자.

@Test
fun test() {
    val items = listOf(
	Item(id = 1, name = "name1"),
	Item(id = 2, name = "name2"),)

    repository.saveAllAndFlush(items)
}
--- return ---
Hibernate: insert into item (name, id) values (?, ?)
Hibernate: insert into item (name, id) values (?, ?)

이상하다. 로그를 찍었는데 insert가 2번씩 나온다. 이것은 실제로 배치가 수행되었는지와는 다른 로그인 것으로 보인다.

5 `generate_statistics`

내가 생각한 방법은 spring.jpa.properties.hibernate.generate_statistics=true 를 프로퍼티에 추가해서 실제로 배치가 수행되었는지 확인하는 것이다.

spring.jpa.properties.hibernate.jdbc.batch_size=2
spring.jpa.properties.hibernate.generate_statistics=true

통계 리턴값을 보자

2023-05-11 22:43:24.190  INFO 79099 --- [    Test worker] i.StatisticalLoggingSessionEventListener : Session Metrics {
    219208 nanoseconds spent acquiring 1 JDBC connections;
    0 nanoseconds spent releasing 0 JDBC connections;
    1038416 nanoseconds spent preparing 1 JDBC statements;
    0 nanoseconds spent executing 0 JDBC statements;
    1927292 nanoseconds spent executing 1 JDBC batches;
    0 nanoseconds spent performing 0 L2C puts;
    0 nanoseconds spent performing 0 L2C hits;
    0 nanoseconds spent performing 0 L2C misses;
    12657834 nanoseconds spent executing 1 flushes (flushing a total of 2 entities and 0 collections);
    0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
}

위에서 보면 1927292 nanoseconds spent executing 1 JDBC batches; 처럼 배치를 수행한 것으로 보인다. 정말 ~batch_size~를 사용하는 것과 다른가? 한번 ~batch_size~를 제거하고 실행해보자.

2023-05-11 22:44:36.470  INFO 79134 --- [    Test worker] i.StatisticalLoggingSessionEventListener : Session Metrics {
    291208 nanoseconds spent acquiring 1 JDBC connections;
    0 nanoseconds spent releasing 0 JDBC connections;
    1292834 nanoseconds spent preparing 2 JDBC statements;
    470541 nanoseconds spent executing 2 JDBC statements;
    0 nanoseconds spent executing 0 JDBC batches;
    0 nanoseconds spent performing 0 L2C puts;
    0 nanoseconds spent performing 0 L2C hits;
    0 nanoseconds spent performing 0 L2C misses;
    12407541 nanoseconds spent executing 1 flushes (flushing a total of 2 entities and 0 collections);
    0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
}

0 nanoseconds spent executing 0 JDBC batches; 처럼 배치를 수행하지 않은 것으로 보인다.

6 프로퍼티 정리

아래 링크를 참고함.

http://devdoc.net/javaweb/hibernate/Hibernate-5.1.0/userGuide/en-US/html/ch11.html

6.1 `batch_size`

드라이버에 배치실행을 요청하기 전에 하이버네이트가 함께 처리할 수 있는 최대 statement 수를 결정한다. 0, 음수는 비활성화시킨다.

6.2 `order_inserts`, `order_updates`

spring.jpa.properties.hibernate.order_inserts 와 spring.jpa.properties.hibernate.order_updates 를 사용하면 된다.

이것들은 여러개의 엔티티를 배치로 수행하고 싶을 때 사용한다. 2개의 서로 다른 엔티티를 동시에 벌크 인서트를 할 때, 배치가 수행되지 않고, 두 엔티티가 번갈아서 insert가 수행될 수 있다.

좀 더 이야기 하면 하이버네이트는 ActionQueue에 InsertAction 과 UpdateAction 을 넣는다. 이 Queue를 정렬해야 배치를 수행할 수 있다.

6.3 `batch_versioned_data`

이것은 Entity로 버전을 관리하는 엔티티의 업데이트를 배치로 수행할지 여부를 결정하는 듯. 어떤 JDBC 드라이버는 배치가 수행될 때 잘못된 행수를 리턴하기도 한다고 함. 이 경우 false로 설정해야한다.

7 ActionQueue

ActionQueue는 org.hibernate.engine.spi.ActionQueue 인터페이스이다. https://docs.jboss.org/hibernate/orm/4.3/javadocs/org/hibernate/engine/spi/ActionQueue.html

위 도큐먼트에 있는 설명은 아래와 같다.

이벤트와 관련된 작업의 대기열을 유지 관리한다. ActionQueue에는 세션의 트랜잭션 쓰기 시맨틱의 일부로 큐에 대기 중인 DML 작업이 보관된다. DML 작업은 플러시로 인해 데이터베이스에 대해 강제로 실행될 때까지 이 큐에 대기합니다.

우리가 지금 봐야할 메서드는 다음과 같다.

addAction(org.hibernate.action.internal.BulkOperationCleanupAction action)
addAction(org.hibernate.action.internal.CollectionRecreateAction action)
addAction(org.hibernate.action.internal.CollectionRemoveAction action)
addAction(org.hibernate.action.internal.CollectionUpdateAction action)
addAction(EntityDeleteAction)
addAction(EntityIdentityInsertAction)
addAction(EntityInsertAction)
addAction(EntityUpdateAction)
addAction(OrphanRemovalAction)
addAction(org.hibernate.action.internal.QueuedOperationCollectionAction action)
public void sortActions()

보면 이곳에서 여러 Action들이 들어가서 대기하는 곳임을 알 수 있다. 그리고 sortActions() 메서드를 보면 이곳에서 정렬을 수행한다. 이것이 배치를 수월하게 수행할 수 있게 해준다.

8 후기

2년만에 스프링 세계로 돌아왔다. 까먹은 것도 많고, 그동안 많은 기능들이 추가되었을 것이다.

이제부터 Clojure보다는 Spring/Kotlin 관련 글을 주로 올리고 Clojure 관련 내용은 덜 올릴 것 같다.

Spring Data Jpa bulk insert

Table of Contents