Issues while iterating over self-loop edges in Apache Spark


Mladen Marović
 

Thanks for the reply. I created an issue at: https://github.com/JanusGraph/janusgraph/issues/2669

Kind regards,

Mladen Marović


hadoopmarc@...
 

Hi Mladen,

Indeed, the self-loop logic you point to still exists in:
https://github.com/JanusGraph/janusgraph/blob/master/janusgraph-hadoop/src/main/java/org/janusgraph/hadoop/formats/util/JanusGraphVertexDeserializer.java

I guess the intent of the filtering of these self loop edges is to prevent that a single self-loop edge appears twice, as IN edge and as OUT edge. I also guess that the actual implementation is buggy: it is not the responsibility of the InputFormat to filter any data (your example!) but rather to represent all data present faithfully. Can you report an issue for this at https://github.com/JanusGraph/janusgraph/issues ?

This also means that there is not an easy way out, other than starting with a private build with a fix (and possibly contributing the fix as a PR).

Best wishes,    Marc


Mladen Marović
 

Hello,

while debugging some Apache Spark jobs that process data from a Janusgraph graph. i noticed some issues with self-loop edges (edges that connect a vertex to itself). The data is read using:

javaSparkContext.newAPIHadoopRDD(hadoopConfiguration(), CqlInputFormat.class, NullWritable.class, VertexWritable.class)

When I try to process all outbound edges of a single vertex using:

vertex.edges(Direction.OUT)

and that vertex has multiple self-loop edges with the same edge label, the iterator always returns only one such edge. Edges that are not self-loop are all returned as expected.

To give a specific example, if I have a vertex V0 with edges that E1, E2, E3, E4, E5 that lead to vertices V1, V2, V3, V4, V5, the call vertex.edges(Direction.OUT) will return an iterator that iterates over all five edges. However, if I have a vertex V0 with edges E1, E2, E3 that lead to V1, V2, V3, and self-loop edges EL1, EL2, EL3, the iterator will iterate over E1, E2, E3, and only one of (EL1, EL2, EL3), giving a total of four edges instead of the expected six.

After further analysis, I came upon this commit:

https://github.com/JanusGraph/janusgraph/commit/d3006dc939c1b640bb263806abd3fd6bee630d12

which explicitly added code that skips deserializing multiple self-loop edges. The code from the linked commit is still present in org.janusgraph:janusgraph-hadoop:0.5.3 and seems to be the cause of this unexpected behavior.

My questions are as follows:

  1. What is the reason behind implementing the change from the given commit?
  2. Is there another way to iterate on all edges, including (possibly) multiple self-loop edges with the same edge label?

Kind regards,

Mladen Marović