Storage Issues

Lmdb and lmdbjava

Lmdb is written in C and located at https://symas.com/lmdb/.

Lmdbjava is a Java wrapper for lmdb, which uses JNI and is located at https://github.com/lmdbjava/lmdbjava.

The performance of lmdbjava compared to lmdb is consistent with the performance of a Java program compared to equivalent the C program.

Lmdbjava does not enhance the functionality of lmdb, so in this paper, when I discuss the capabilities of lmdbjava I am also referring to the capabilities of lmdb.


Requirement :  each key can be associated with multiple large keys

Greg, Mike, and I discussed the storage layer requirements in a call on September 11, 2017.  Greg mentioned that each key may be associated with multiple blobs.  Mike asked Greg how big the blogs would be.  He asked “Kilobytes?  Megabytes?  Gigabytes?”  Greg answered, “Megabytes.”  Neither Mike nor I asked Greg if this requirement is a necessity.

This requirement cannot be satisfied by the API exposed by lmdbjava.  The problem is that when lmdbjava is configured to associate multiple values with a key, the size of those keys is constrained to 511 bytes.  In this section, I explain this problem in more detail.

When you create an instance of lmdbjava in Scala, you specify that a key can be associated with multiple values by passing the MDB_DUPSORT parameter.

    val db: Dbi[ByteBuffer] = env.openDbi(nameIn, MDB_CREATE, MDB_DUPSORT)

When MDB_DUPSORT is present, the size of each value is constrained to 511 bytes.  If you’d like to see the code that imposes that restraint, I provide it later in this message.

Since clients of the storage layer want to associate multiple large values with a key, we need to work around this limitation in lmdbjava.

My solution is to use two lmdbjava databases, one configured with MDB_DUPSORT and one that is not.  The one that is not configured with MDB_DUPSORT can associate a key with a single large value.  Call the MDB_DUPSORT database the keyDb and the non-MDB_DUPSORT database the valueKeyDb.

When a client of the storage layer adds a key and a value, we generate a small intermediary value that functions as a key to the larger value in the valueKeyDb.  Call this intermediary value a “value-key.”

For example, suppose we have a key k with which we want to associate two large values named bigValue1 and bigValue2.  When the client associates k with bigValue1, we generate a value-key vk1 and associate k with vk1 in the keyDb.  In the valueKeyDb, we associate vk1 with bigValue1.  Similarly, when the client associates k with bigValue2, we generate a value-key vk2 and associate k with vk2 in the keyDb and associate vk2 with bigValue2 in the valueKeyDb.

When the client wants to retrieve the values associated with k, we retrieve vk1 and vk2 from keyDb.  The we look up the values associated with vk1 and vk2 in valueKeyDb and return those to the client.

The main problem to be solved is how to generate unique value-keys.  There are two simple approaches that are insufficient.  The first approach is to simply use a timestamp for the value-key.  This won’t work because most clocks are only accurate to milliseconds and many database insertions can happen in a millisecond.  The second approach is to generate a long random as a value-key.  This approach is insufficient because in my testing I found that in multiple tests when I generated 100,000 random variables there would be duplicate random values.

My solution is to combine both approaches in the generation of the value-key. Specifically, given a key, the value-key is generated with these steps:

  1. Create a string representation of the key.
  2. Create a string of the form _XXXXXXXXXXX_XXXXXXXX, where the leftmost bunch of 11 Xs are hexadecimal digits representing the time in milliseconds and the rightmost bunch of 8 Xs are hexadecimal digits representing a random number.
  3. Append the string from step 2 to the string from step 1.

Thus the generated value-key has the form <key as string>_XXXXXXXXXXX_XXXXXXXX.

Admittedly, my solution does not guarantee uniqueness, but I believe it is as close as we can get and minimizes the chance of collision.

One consequence of using this approach is that the maximum size of a key is 490 bytes because the last 21 bytes are devoted to the _XXXXXXXXXXX_XXXXXXXX string appended to the end of the value-key.


Here is the relevant code.

// random number generator
val randGen = new scala.util.Random(Calendar.getInstance().getTimeInMillis())

// constants used to generate value-key
// Int.MaxValue -> 7FFFFFFF
val randGenMax = Int.MaxValue
val randGenMaxDigits = 8
val millisMaxDigits = 11
val randGenFormat = "%0"+randGenMaxDigits +"X"
// value-key example, _15F22F3D18C_3F72CF5C
val postfixLength = randGenMaxDigits + millisMaxDigits + 2

// When a new value-key postfix is needed, call this function.
// Example return value: _15F22F3D18C_3F72CF5C
def nextId(randGen:scala.util.Random,
        valuesInUse:scala.collection.mutable.SortedSet[String])
: String = {
      var next = ""
      do {
            val now = Calendar.getInstance().getTimeInMillis()
            val nowHexString = randGenFormat.format(now).toString
            assert(nowHexString.length <= millisMaxDigits)

            val rand = randGen.nextInt(randGenMax)
            val randHexString = randGenFormat.format(rand).toString
            assert(randHexString.length <= randGenMaxDigits)
            next = "_" + nowHexString + "_" + randHexString
            assert(!valuesInUse.contains(next))
      } while (valuesInUse.contains(next))
      valuesInUse.add(next)
      next
}

// Does the string have a value-key postfix?
// "Dup" is meant to suggest MDB_DUPSORT
def isDupValue(valueIn:String): Boolean = {
      // value postfix example, _15F22F3D18C_3F72CF5C
      //                       2109876543210987654321
      if (valueIn.length <= postfixLength)
            return false

      var value = valueIn
      do {
            if (value(value.length - 9) != '_' || value(value.length - 21) != '_')
                  return false
            if (!isHexString(value.substring(value.length - 20, value.length - 9)))
                  return false
            if (!isHexString(value.substring(value.length - 8)))
                  return false
            value = value.substring(0, value.length - postfixLength)
      } while (postfixLength < value.length)
      true
}

// https://stackoverflow.com/questions/19704787/scala-check-if-string-contains-no-special-chars
def isHexString(s:String): Boolean = {
      val ordinary=(('a' to 'f') ++ ('A' to 'F') ++ ('0' to '9')).toSet
      s.forall(ordinary.contains(_))
}

 

// Test functions

def dupValuesAppearsSafe():Unit = {
      println("dupValuesAppearsSafe() begin")
      val randGen =
            new scala.util.Random(Calendar.getInstance().getTimeInMillis())
      val valuesInUse =
            scala.collection.mutable.SortedSet[String]()
      val safe = dupValuesAppearsSafe(randGen, valuesInUse)
      println("dupValuesAppearsSafe() end\n")
}


def dupValuesAppearsSafe(randGen:scala.util.Random,
        valuesInUse:scala.collection.mutable.SortedSet[String],
        numMillis:Int = 100000)
: Boolean = {
      val millis = new Array[String](numMillis)
      for (i <- 0 until numMillis) {
            val m = Storage.nextId(randGen, valuesInUse)
            // println(m)
            var alreadySeen = false
            for (j <- 0 until i) {
                  if (m == millis(j)) {
                  println("dupValuesAppearsSafe(): duplicate IDs:")
                  println(s"$i: $m, $j: ${millis(j)}")
                  assert(false)
                  return false
                  }
            }
            millis(i) = m
      }
      true
}

def testDupValues(): Unit = {
      println("testDupValues() begin")

      assert(!Storage.isDupValue(""))
      assert(!Storage.isDupValue("x"))
      assert(!Storage.isDupValue("_15F22F3D18C_3F72CF5C"))
      assert( Storage.isDupValue("x_15F22F3D18C_3F72CF5C"))
      assert( Storage.isDupValue("x_15F22F3D18C_3F72CF5C_15F22F3D18C_3F72CF5C"))
      assert(!Storage.isDupValue("x_015F22F3D18C_3F72CF5C"))
      assert(!Storage.isDupValue("x_5F22F3D18C_3F72CF5C"))
      assert(!Storage.isDupValue("x_15F22F3D18C_03F72CF5C"))
      assert(!Storage.isDupValue("x_15F22F3D18C_F72CF5C"))
      assert(!Storage.isDupValue("x_G5F22F3D18C_3F72CF5C"))
      assert(!Storage.isDupValue("x_15F22F3D18C_GF72CF5C"))

      println("testDupValues() end")
}


 

Lmdb code that constrains the size of MDB_DUPSORT values

If lmdb is configured with MDB_DUPSORT then the size of a value associated with a key cannot exceed 511 bytes.  See lmdb C code in mdb.c.  The use of the macro ENV_MAXKEY() to constrain the size of values is confusing since it is constraining the size of the value.

  #if SIZE_MAX > MAXDATASIZE
  if (data->mv_size > ((mc->mc_db->md_flags & MDB_DUPSORT) ? ENV_MAXKEY(env) : MAXDATASIZE))
    return MDB_BAD_VALSIZE;
  #else
  if ((mc->mc_db->md_flags & MDB_DUPSORT) && data->mv_size > ENV_MAXKEY(env))
    return MDB_BAD_VALSIZE;
  #endif


Here is the ENV_MAXKEY() macro definition in mdb.c where the value is hardcoded to 511 bytes.


  /**     @brief The max size of a key we can write, or 0 for computed max.
    *
    *      This macro should normally be left alone or set to 0.
    *      Note that a database with big keys or dupsort data cannot be
    *      reliably modified by a liblmdb which uses a smaller max.
    *      The default is 511 for backwards compat, or 0 when #MDB_DEVEL.
    *
    *      Other values are allowed, for backwards compat.  However:
    *      A value bigger than the computed max can break if you do not
    *      know what you are doing, and liblmdb <= 0.9.10 can break when
    *      modifying a DB with keys/dupsort data bigger than its max.
    *
    *      Data items in an #MDB_DUPSORT database are also limited to
    *      this size, since they're actually keys of a sub-DB.  Keys and
    *      #MDB_DUPSORT data items must fit on a node in a regular page.
    */
  #ifndef MDB_MAXKEYSIZE
  #define MDB_MAXKEYSIZE   ((MDB_DEVEL) ? 0 : 511)
  #endif
  /**     The maximum size of a key we can write to the environment. */
  #if MDB_MAXKEYSIZE
  #define ENV_MAXKEY(env) (MDB_MAXKEYSIZE)
  #else
  #define ENV_MAXKEY(env) ((env)->me_maxkey)
  #endif