Issues while iterating over self-loop edges in Apache Spark

Mladen Marović


while debugging some Apache Spark jobs that process data from a Janusgraph graph. i noticed some issues with self-loop edges (edges that connect a vertex to itself). The data is read using:

javaSparkContext.newAPIHadoopRDD(hadoopConfiguration(), CqlInputFormat.class, NullWritable.class, VertexWritable.class)

When I try to process all outbound edges of a single vertex using:


and that vertex has multiple self-loop edges with the same edge label, the iterator always returns only one such edge. Edges that are not self-loop are all returned as expected.

To give a specific example, if I have a vertex V0 with edges that E1, E2, E3, E4, E5 that lead to vertices V1, V2, V3, V4, V5, the call vertex.edges(Direction.OUT) will return an iterator that iterates over all five edges. However, if I have a vertex V0 with edges E1, E2, E3 that lead to V1, V2, V3, and self-loop edges EL1, EL2, EL3, the iterator will iterate over E1, E2, E3, and only one of (EL1, EL2, EL3), giving a total of four edges instead of the expected six.

After further analysis, I came upon this commit:

which explicitly added code that skips deserializing multiple self-loop edges. The code from the linked commit is still present in org.janusgraph:janusgraph-hadoop:0.5.3 and seems to be the cause of this unexpected behavior.

My questions are as follows:

  1. What is the reason behind implementing the change from the given commit?
  2. Is there another way to iterate on all edges, including (possibly) multiple self-loop edges with the same edge label?

Kind regards,

Mladen Marović

Join to automatically receive all group messages.