3

I’m experiencing data loss when writing a large DataFrame to Redis using the Spark-Redis connector.

Details:

  • I have a DataFrame with millions of rows.

  • Writing to Redis works correctly for small DataFrames, but when the DataFrame is large, some rows seem to be missing after the write.

Observations:

  1. Reading back from Redis via the Spark-Redis connector returns fewer rows than the original DataFrame.

  2. Reading directly by key or using scan_iter also returns fewer entries.

  3. There are no duplicate rows in the DataFrame.

  4. This issue only happens with large datasets; small datasets are written correctly.

Question:

  • Why does Spark-Redis drop rows when writing large DataFrames?

  • Are there any recommended settings, configurations, or approaches to reliably write large datasets to Redis using Spark-Redis?

Example Code

# Prepare Redis key column
df_to_redis = df.withColumn("key", F.concat(F.lit("{"), F.col("uid"), F.lit("}"))).select("key", "lang")

# Write to Redis
df_to_redis.write.format("org.apache.spark.sql.redis") \
    .option("table", "info") \
    .option("key.column", "key")
    .option("host", "REDIS_HOST") \
    .option("port", 6379) \
    .option("dbNum", 0) \
    .mode("append") \
    .save()
# Reading back from Redis using Spark-Redis
df_redis = spark.read.format("org.apache.spark.sql.redis") \
    .option("table", "info") \
    .option("host", "REDIS_HOST") \
    .option("port", 6379) \
    .option("dbNum", 0) \
    .load()
# Reading all keys directly from Redis using redis-py keys()
r = redis.Redis(host="REDIS_HOST", port=6379, db=0)
all_keys = r.keys("info:*")
print(f"Number of keys read via keys(): {len(all_keys)}")
# Reading all keys from Redis using scan_iter()
r = redis.Redis(host="REDIS_HOST", port=6379, db=0)
keys = list(r.scan_iter("info:*"))
print(f"Number of keys read via scan_iter: {len(keys)}")
1
  • Did you set correct timeout. Connection timeout might have failed to write all the data. I see similar issue in the link. do checkout if the solution in that thread helps stackoverflow.com/questions/67924823/… Commented Nov 19 at 5:07

1 Answer 1

0

I’ve seen this issue before, and it wasn’t caused by a Redis write failure or data loss. The real problem was that Redis had run out of memory.
You can check this using the INFO command.
Follow is the example code.

If used_memory_human and maxmemory_human show the same value, it means Redis has reached its memory limit and can’t store any additional data.

import redis

# Connect to Redis
r = redis.Redis(
    host="your-redis-host",
    port=port_num,
    db=db_num
)


# Fetch memory usage
memory_info = r.info("memory")
usage = {
    "used_memory_human": memory_info.get("used_memory_human"),
    "maxmemory_human": memory_info.get("maxmemory_human")
}

print(usage)
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.