Date   

Re: [BLOG] Configuring JanusGraph for spark-yarn

liuzhip...@...
 

Hey - Joseph,Did your test successed?Can you share your experience for me ? Thx

在 2017年8月15日星期二 UTC+8上午6:17:12,Joseph Obernberger写道:

Marc - thank you for this.  I'm going to try getting the latest version of JanusGraph, and compiling it with our specific version of Cloudera CDH, then run some tests.  Will report back.

-Joe


On 8/13/2017 4:07 PM, HadoopMarc wrote:

Hi Joe,

To shed some more light on the running figures you presented, I ran some tests on my own cluster:

1. I loaded the default janusgraph-hbase table with the following simple script from the console:

graph=JanusGraphFactory.open("conf/janusgraph-hbase.properties")
g = graph.traversal()
m = 1200L
n = 10000L
(0L..<m).each{
        (0L..<n).each{
                v1 = g.addV().id().next()
                v2 = g.addV().id().next()
                g.V(v1).addE('link1').to(g.V(v2)).next()
                g.V(v1).addE('link2').to(g.V(v2)).next()
        }
        g.tx().commit()
}

This scipt runs about 20(?) minutes and results in 24M vertices and edges committed to the graph.

2. I did an OLTP g.V().count() on this graph from the console: 11 minutes first time, 10 minutes second time

3. I ran OLAP jobs on this graph using janusgraph-hhbase in two ways:
    a) with g = graph.traversal().withComputer(SparkGraphComputer)  
    b) with g = graph.traversal().withComputer(new Computer().graphComputer(SparkGraphComputer).workers(10))

the properties file was as in the recipe, with the exception of:
   spark.executor.memory=4096m       # smaller values might work, but the 512m from the recipe is definitely too small
   spark.executor.instances=4
   #spark.executor.cores not set, so default value 1

This resulted in the following running times:
   a) stage 0,1,2 => 12min, 12min, 3s => 24min total
   b) stage 0,1,2 => 18min, 1min, 86ms => 19 min total

Discussion:
  • HBase is not an easy source for OLAP: HBase wants large regions for efficiency (configurable, but typically 2-20GB), while mapreduce inputformats (like janusgraph's HBaseInputFormat) take regions as inputsplits by default. This means that only a few executors will read from HBase unless the HBaseInputFormat is extended to split a region's keyspace into multiple inputsplits. This mismatch between the numbers of regions and spark executors is a potential JanusGraph issue. Examples exist to improve on this, e.g. org.apache.hadoop.hbase.mapreduce.RowCounter

  • For spark stages after stage 0 (reading from HBase), increasing the number of spark tasks with the "workers()" setting helps optimizing the parallelization. This means that for larger traversals than just a vertex count, the parallelization with spark will really pay off.

  • I did not try to repeat your settings with a large number of cores. Various sources discourage the use of spark.executor.cores values larger than 5, e.g. https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/, https://stackoverflow.com/questions/37871194/how-to-tune-spark-executor-number-cores-and-executor-memory
Hopefully, these tests provide you and other readers with some additional perspectives on the configuration of janusgraph-hbase.

Cheers,    Marc

Op donderdag 10 augustus 2017 15:40:21 UTC+2 schreef Joseph Obernberger:

Thank you Marc.

I did not set spark.executor.instances, but I do have spark.cores.max set to 64 and within YARN, it is configured to allow has much RAM/cores for our 5 server cluster.  When I run a job on a table that has 61 regions, I see that 43 tasks are started and running on all 5 nodes in the Spark UI (and running top on each of the servers).  If I lower the amount of RAM (heap) that each tasks has (currently set to 10G), they fail with OutOfMemory exceptions.  It still hits one HBase node very hard and cycles through them.  While that may be a reason for a performance issue, it doesn't explain the massive number of calls that HBase receives for a count job, and why using SparkGraphComputer takes so much more time.

Running with your command below appears to not alter the behavior.  I did run a job last night with DEBUG turned on, but it produced too much logging filling up the log directory on 3 of the 5 nodes before stopping. 
Thanks again Marc!

-Joe


On 8/10/2017 7:33 AM, HadoopMarc wrote:
Hi Joe,

Another thing to try (only tested on Tinkerpop, not on JanusGraph): create the traversalsource as follows:

g = graph.traversal().withComputer(new Computer().graphComputer(SparkGraphComputer).workers(100))

With HadoopGraph this helps hdfs files with very large or no partitions to be split across tasks; I did not check the effect yet for HBaseInputFormat in JanusGraph. And did you add spark.executor.instances=10 (or some suitable number) to your config? And did you check in the RM ui or Spark history server whether these executors were really allocated and started?

More later,

Marc

Op donderdag 10 augustus 2017 00:13:09 UTC+2 schreef Joseph Obernberger:

Marc - thank you.  I've updated the classpath and removed nearly all of the CDH jars; had to keep chimera and some of the HBase libs in there.  Apart from those and all the jars in lib.zip, it is working as it did before.  The reason I turned DEBUG off was because it was producing 100+GBytes of logs.  Nearly all of which are things like:

18:04:29 DEBUG org.janusgraph.diskstorage.hbase.HBaseKeyColumnValueStore - Generated HBase Filter ColumnRangeFilter [\x10\xC0, \x10\xC1)
18:04:29 DEBUG org.janusgraph.graphdb.transaction.StandardJanusGraphTx - Guava vertex cache size: requested=20000 effective=20000 (min=100)
18:04:29 DEBUG org.janusgraph.graphdb.transaction.vertexcache.GuavaVertexCache - Created dirty vertex map with initial size 32
18:04:29 DEBUG org.janusgraph.graphdb.transaction.vertexcache.GuavaVertexCache - Created vertex cache with max size 20000
18:04:29 DEBUG org.janusgraph.diskstorage.hbase.HBaseKeyColumnValueStore - Generated HBase Filter ColumnRangeFilter [\x10\xC2, \x10\xC3)
18:04:29 DEBUG org.janusgraph.graphdb.transaction.StandardJanusGraphTx - Guava vertex cache size: requested=20000 effective=20000 (min=100)
18:04:29 DEBUG org.janusgraph.graphdb.transaction.vertexcache.GuavaVertexCache - Created dirty vertex map with initial size 32
18:04:29 DEBUG org.janusgraph.graphdb.transaction.vertexcache.GuavaVertexCache - Created vertex cache with max size 20000

Do those mean anything to you?  I've turned it back on for running with smaller graph sizes, but so far I don't see anything helpful there apart from an exception about not setting HADOOP_HOME.
Here are the spark properties; notice the nice and small extraClassPath!  :)

Name

Value

gremlin.graph

org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph

gremlin.hadoop.deriveMemory

false

gremlin.hadoop.graphReader

org.janusgraph.hadoop.formats.hbase.HBaseInputFormat

gremlin.hadoop.graphWriter

org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat

gremlin.hadoop.graphWriter.hasEdges

false

gremlin.hadoop.inputLocation

none

gremlin.hadoop.jarsInDistributedCache

true

gremlin.hadoop.memoryOutputFormat

org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat

gremlin.hadoop.outputLocation

output

janusgraphmr.ioformat.conf.storage.backend

hbase

janusgraphmr.ioformat.conf.storage.hbase.region-count

5

janusgraphmr.ioformat.conf.storage.hbase.regions-per-server

5

janusgraphmr.ioformat.conf.storage.hbase.short-cf-names

false

janusgraphmr.ioformat.conf.storage.hbase.table

TEST0.2.0

janusgraphmr.ioformat.conf.storage.hostname

10.22.5.65:2181

log4j.appender.STDOUT

org.apache.log4j.ConsoleAppender

log4j.logger.deng

WARNING

log4j.rootLogger

STDOUT

org.slf4j.simpleLogger.defaultLogLevel

warn

spark.akka.frameSize

1024

spark.app.id

application_1502118729859_0041

spark.app.name

Apache TinkerPop's Spark-Gremlin

spark.authenticate

false

spark.cores.max

64

spark.driver.appUIAddress

http://10.22.5.61:4040

spark.driver.extraJavaOptons

-XX:ReservedCodeCacheSize=100M -XX:MaxMetaspaceSize=256m -XX:CompressedClassSpaceSize=256m

spark.driver.extraLibraryPath

/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hadoop/lib/native

spark.driver.host

10.22.5.61

spark.driver.port

38529

spark.dynamicAllocation.enabled

true

spark.dynamicAllocation.executorIdleTimeout

60

spark.dynamicAllocation.minExecutors

0

spark.dynamicAllocation.schedulerBacklogTimeout

1

spark.eventLog.dir

hdfs://host001:8020/user/spark/applicationHistory

spark.eventLog.enabled

true

spark.executor.extraClassPath

/opt/cloudera/parcels/CDH/jars/janusgraph-hbase-0.2.0-SNAPSHOT.jar:./lib.zip/*:/opt/cloudera/parcels/CDH/lib/hbase/bin/../lib/*:/etc/hbase/conf:

spark.executor.extraJavaOptions

-XX:ReservedCodeCacheSize=100M -XX:MaxMetaspaceSize=256m -XX:CompressedClassSpaceSize=256m -Dlogback.configurationFile=logback.xml

spark.executor.extraLibraryPath

/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hadoop/lib/native

spark.executor.heartbeatInterval

100000

spark.executor.id

driver

spark.executor.memory

10240m

spark.externalBlockStore.folderName

spark-27dac3f3-dfbc-4f32-b52d-ececdbcae0db

spark.kyroserializer.buffer.max

1600m

spark.master

yarn-client

spark.network.timeout

90000

spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS

host005

spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES

http://host005:8088/proxy/application_1502118729859_0041

spark.scheduler.mode

FIFO

spark.serializer

org.apache.spark.serializer.KryoSerializer

spark.shuffle.service.enabled

true

spark.shuffle.service.port

7337

spark.ui.filters

org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter

spark.ui.killEnabled

true

spark.yarn.am.extraLibraryPath

/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hadoop/lib/native

spark.yarn.appMasterEnv.CLASSPATH

/etc/haddop/conf:/etc/hbase/conf:./lib.zip/*

spark.yarn.config.gatewayPath

/opt/cloudera/parcels

spark.yarn.config.replacementPath

{{HADOOP_COMMON_HOME}}/../../..

spark.yarn.dist.archives

/home/graph/janusgraph-0.2.0-SNAPSHOT-hadoop2.JOE/lib.zip

spark.yarn.dist.files

/home/graph/janusgraph-0.2.0-SNAPSHOT-hadoop2.JOE/conf/logback.xml

spark.yarn.dist.jars

/opt/cloudera/parcels/CDH/jars/janusgraph-hbase-0.2.0-SNAPSHOT.jar

spark.yarn.historyServer.address

http://host001:18088

zookeeper.znode.parent

/hbase


-Joe

On 8/9/2017 3:33 PM, HadoopMarc wrote:
Hi Gari and Joe,

Glad to see you testing the recipes for MapR and Cloudera respectively!  I am sure that you realized by now that getting this to work is like walking through a minefield. If you deviate from the known path, the odds for getting through are dim, and no one wants to be in your vicinity. So, if you see a need to deviate (which there may be for the hadoop distributions you use), you will need your mine sweeper, that is, put the logging level to DEBUG for relevant java packages.

This is where you deviated:
  • for Gari: you put all kinds of MapR lib folders on the applications master's classpath (other classpath configs are not visible from your post)
  • for Joe: you put all kinds of Cloudera lib folders on the executors classpath (worst of all the spark-assembly.jar)

Probably, you experience all kinds of mismatches in netty libraries which slows down or even kills all comms between the yarn containers. The philosophy of the recipes really is to only add the minimum number of conf folders and jars to the Tinkerpop/Janusgraph distribution and see from there if any libraries are missing.


At my side, it has become apparent that I should at least add to the recipes:

  • proof of work for a medium-sized graph (say 10M vertices and edges)
  • configs for the number of executors present in the OLAP job (instead of relying on spark default number of 2)

So, still some work to do!


Cheers,    Marc


--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Virus-free. www.avg.com

--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: What's wrong with this code? It throws NoSuchElementException when I try to add an Edge?

Jason Plurad <plu...@...>
 

Double check your usage of "propId" for the "b" vertex:

Vertex creation:

g.addV().property(String.format("propId", cols[5]), cols[3])

Traversal:

V().has("propId", cols[3])



On Sunday, August 20, 2017 at 2:41:59 PM UTC-4, 刑天 wrote:



package com.sankuai.kg;


import java.io.File;
import java.util.Iterator;


import org.apache.commons.io.FileUtils;
import org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.GraphTraversal;
import org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.GraphTraversalSource;
import org.apache.tinkerpop.gremlin.structure.Graph;
import org.apache.tinkerpop.gremlin.structure.Transaction;
import org.apache.tinkerpop.gremlin.structure.Vertex;
import org.apache.tinkerpop.gremlin.structure.util.empty.EmptyGraph;


public class Loader {


   
public static void main(String[] args) throws Exception {
       
Graph graph = EmptyGraph.instance();
       
GraphTraversalSource g = graph.traversal().withRemote("remote-graph.properties");
       
Iterator<String> lineIt = FileUtils.lineIterator(new File(args[0]));
       
while (lineIt.hasNext()) {
           
String line = lineIt.next();
           
String[] cols = line.split(",");
           
           
GraphTraversal<Vertex, Vertex> t1 = g.V().has("poiId", cols[0]);
           
GraphTraversal<Vertex, Vertex> t2 = g.V().has("poiId", cols[3]);


           
if (!t1.hasNext())
                g
.addV().property("poiId", cols[0]).property("name", cols[1]).property("type", cols[2]).next();
           
if (!t2.hasNext())
                g
.addV().property(String.format("propId", cols[5]), cols[3]).property("name", cols[4])
                       
.property("type", cols[5]).next();
            g
.V().has("poiId", cols[0]).as("a").V().has("propId", cols[3]).as("b").addE(cols[6])
                   
.from("a").to("b").next();
           
       
}


        g
.close();
   
}


}



What's wrong with this code? It throws NoSuchElementException when I try to add an Edge?

刑天 <gaoxtw...@...>
 




package com.sankuai.kg;


import java.io.File;
import java.util.Iterator;


import org.apache.commons.io.FileUtils;
import org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.GraphTraversal;
import org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.GraphTraversalSource;
import org.apache.tinkerpop.gremlin.structure.Graph;
import org.apache.tinkerpop.gremlin.structure.Transaction;
import org.apache.tinkerpop.gremlin.structure.Vertex;
import org.apache.tinkerpop.gremlin.structure.util.empty.EmptyGraph;


public class Loader {


   
public static void main(String[] args) throws Exception {
       
Graph graph = EmptyGraph.instance();
       
GraphTraversalSource g = graph.traversal().withRemote("remote-graph.properties");
       
Iterator<String> lineIt = FileUtils.lineIterator(new File(args[0]));
       
while (lineIt.hasNext()) {
           
String line = lineIt.next();
           
String[] cols = line.split(",");
           
           
GraphTraversal<Vertex, Vertex> t1 = g.V().has("poiId", cols[0]);
           
GraphTraversal<Vertex, Vertex> t2 = g.V().has("poiId", cols[3]);


           
if (!t1.hasNext())
                g
.addV().property("poiId", cols[0]).property("name", cols[1]).property("type", cols[2]).next();
           
if (!t2.hasNext())
                g
.addV().property(String.format("propId", cols[5]), cols[3]).property("name", cols[4])
                       
.property("type", cols[5]).next();
            g
.V().has("poiId", cols[0]).as("a").V().has("propId", cols[3]).as("b").addE(cols[6])
                   
.from("a").to("b").next();
           
       
}


        g
.close();
   
}


}



Re: Performance issues on a laptop.

Ray Scott <raya...@...>
 

I've just managed to connect to my standalone Cassandra 3.11 and can work from the gremlin shell with no visible hit on the CPUs, so I'm happy with that. In the past, I think I've missed the notice about running "nodetool enablethrift" and would get a error regarding AstyanaxStoreManager.

This the setup I need going forward anyway as I plan to websocket into gremlin server from microservices. 

I still have no idea why running janusgrah.sh causes it's cassandra to go berserk.

 

On Friday, 18 August 2017 18:56:56 UTC+1, Robert Dale wrote:
Maybe search to see if there's a known issue with running Cassandra on a MacBook. You could upgrade Cassandra if that's what you need. I believe JanusGraph is known to work with all current versions.

Robert Dale

On Fri, Aug 18, 2017 at 12:06 PM, 'Ray Scott' via JanusGraph users list <janusgra...@googlegroups.com> wrote:
As soon as I killed the cassandra process, the CPU usage plummeted. So at least I know who the culprit was. I'll just start Gremlin Server directly (configured to us BDB), instead of using JanusGraph.sh. 


On Thursday, 17 August 2017 19:22:51 UTC+1, Robert Dale wrote:
For how long does the cpu remain high after you get the `gremlin>` prompt?

Robert Dale

On Thu, Aug 17, 2017 at 2:20 PM, 'Ray Scott' via JanusGraph users list <janu...@...> wrote:
Hi, 

I'm trying to get JanusGraph running on my laptop (MacBook Air 2 Core Intel i7, 8GB) so that I can develop a small working prototype. 

I've gone for the janusgraph.sh method of starting the server and everything is fine until I open a gremlin shell. Then I see the CPU usage for my terminal rocket up past 350%. Once I close the gremlin shell, the CPU usage remains at the same high level, indefinitely. I've tried launching with the Berkeley DB option and no Elastic Search, but I get the exact same behaviour. 

Is there something I can do to stop this from using most of my CPU, or do I just have to live with it? 

Is there a "lite" recommended setup that someone has had success with in the past? 

Thanks. 

--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Spark connector

HadoopMarc <bi...@...>
 

Hi Takao,

JanusGraph reads data from distributed backends into hadoop using its HBaseInputFormat and CassandraInputFomat classes (which are descendents of org.apache.hadoop.mapreduce.InputFormat). Therefore, it seems possible to directly access graphs in these backends from spark using sc.newAPIHadoopRDD. AFAIK, this particular use of the inputformats is nowhere documented or demonstrated, though.

My earlier answer effectively came down to storing the graph to hdfs using the OutputRDD class for the gremlin.hadoop.graphWriter property and spark serialization (my earlier suggestion of persisting the graphRDD using PersistedOutputRDD would not work for you because python and gremlin-server would not share the same SparkContext). This may or may not be easier or more efficient than writing your own csv input/output routines (in combination with the BulkDumperVertexProgram to parallelize the writing).

Hope this helps,

Marc



Op vrijdag 18 augustus 2017 04:19:33 UTC+2 schreef Takao Magoori:

Hi Marc,

Thank you!
But I don't understand what you mean, sorry.
I feel SparkGraphComputer is "OLAP by gremlin on top of spark distributed power". But I want "OLAP by spark using janusGraph data".

So, I want to run "spark-submit", create pyspark sparkContext, load JanusGraph data into DataFrame. Then, I can use spark Dataframe, spark ML and python machine-learning packages.
The following pseudo-code is what really I want. (like https://github.com/sbcd90/spark-orientdb)
If there is no such solution, I guess I have to "dump whole graph into csv and read it from pyspark".

--------
spark_session = SparkSession.builder.appName('test').enableHiveSupport().getOrCreate()

df_user = spark_session.read.format(
    'org.apache.janusgraph.some_spark_gremlin_connector',
).options(
    url='url',
    query='g.V().hasLabel("user").has("age", gt(29)).valueMap("user_id", "name" "age")',
).load().dropna().join(
    other=some_df,
)


df_item = spark_session.read.format(
    'org.apache.janusgraph.some_spark_gremlin_connector',
).options(
    url='url',
    query='g.V().hasLabel("user").has("age", gt(29)).out("buy").hasLabel("item").valueMap("item_id", "name")',
).load().dropna()


df_sale = spark_session.read.format(
    'org.apache.janusgraph.some_spark_gremlin_connector',
).options(
    url='url',
    query='g.V().hasLabel("user").has("age", gt(29)).outE("buy").valueMap("timestamp")',
).load().select(
    col('item_id'),
    col('name'),
).dropna()
--------


2017年8月18日金曜日 4時08分02秒 UTC+9 HadoopMarc:
Hi Takao,

Only some directions. If you combine:

http://yaaics.blogspot.nl/              (using CassandraInputFormat in your case)
http://tinkerpop.apache.org/docs/current/reference/#interacting-with-spark

it should be possible to access the PersistedInputRDD alias graphRDD from the Spark object.
Never done this myself, I would be interested to read if this works! Probably you will need to run an OLAP query with SparkGraphComputer anyway (e.g. g.V()) to have the PersistedInputRDD realized (RDD's are not realized until a spark action is run on them.)

Cheers,     Marc


Op donderdag 17 augustus 2017 16:25:42 UTC+2 schreef Takao Magoori:
I have a JanusGraph Server (github master, gremlin 3.2.5) on top of Cassandra storage backend, to store users, items and "WHEN, WHERE, WHO bought WHAT ?" relations.
To get data from and modify data in the graph, I use Python aiogremlin driver-mode (== groovy sessionless eval mode) and it works well for now. Thanks developers !

So now, I have to compute recommendation and forecast item sales.
In order to data-cleaning, data-normalization, recommendation and forecasting, Because of a little big graph, I want to use higher-level pyspark tools (ex. DataFrame, ML) and python machine learning packages (ex, scikit-learn). But I can not find the way to load graph data into Spark. What I want is "connector" which can be used by pyspark to load data from JanusGraph, not SparkGraphComputer.

Could someone please how to do it ?


- Additional info
It seems OrientDB has some Spark connectors (though, I don't know these can be used by pyspark). But I want JanusGraph's one.


Re: Performance issues on a laptop.

Kevin Schmidt <ktsc...@...>
 

Sorry, should have bee more explicit, I've done that on Mac OS X with no problems and no maxed out CPU.

On Fri, Aug 18, 2017 at 11:18 AM, Kevin Schmidt <ktsc...@...> wrote:
I had used Cassandra 2.x and 3.0.9 with Titan with no issues.

On Fri, Aug 18, 2017 at 10:56 AM, Robert Dale <rob...@...> wrote:
Maybe search to see if there's a known issue with running Cassandra on a MacBook. You could upgrade Cassandra if that's what you need. I believe JanusGraph is known to work with all current versions.

Robert Dale

On Fri, Aug 18, 2017 at 12:06 PM, 'Ray Scott' via JanusGraph users list <janusgraph-users@googlegroups.com> wrote:
As soon as I killed the cassandra process, the CPU usage plummeted. So at least I know who the culprit was. I'll just start Gremlin Server directly (configured to us BDB), instead of using JanusGraph.sh. 


On Thursday, 17 August 2017 19:22:51 UTC+1, Robert Dale wrote:
For how long does the cpu remain high after you get the `gremlin>` prompt?

Robert Dale

On Thu, Aug 17, 2017 at 2:20 PM, 'Ray Scott' via JanusGraph users list <janu...@...> wrote:
Hi, 

I'm trying to get JanusGraph running on my laptop (MacBook Air 2 Core Intel i7, 8GB) so that I can develop a small working prototype. 

I've gone for the janusgraph.sh method of starting the server and everything is fine until I open a gremlin shell. Then I see the CPU usage for my terminal rocket up past 350%. Once I close the gremlin shell, the CPU usage remains at the same high level, indefinitely. I've tried launching with the Berkeley DB option and no Elastic Search, but I get the exact same behaviour. 

Is there something I can do to stop this from using most of my CPU, or do I just have to live with it? 

Is there a "lite" recommended setup that someone has had success with in the past? 

Thanks. 

--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



Re: Performance issues on a laptop.

Kevin Schmidt <ktsc...@...>
 

I had used Cassandra 2.x and 3.0.9 with Titan with no issues.

On Fri, Aug 18, 2017 at 10:56 AM, Robert Dale <rob...@...> wrote:
Maybe search to see if there's a known issue with running Cassandra on a MacBook. You could upgrade Cassandra if that's what you need. I believe JanusGraph is known to work with all current versions.

Robert Dale

On Fri, Aug 18, 2017 at 12:06 PM, 'Ray Scott' via JanusGraph users list <janusgraph-users@googlegroups.com> wrote:
As soon as I killed the cassandra process, the CPU usage plummeted. So at least I know who the culprit was. I'll just start Gremlin Server directly (configured to us BDB), instead of using JanusGraph.sh. 


On Thursday, 17 August 2017 19:22:51 UTC+1, Robert Dale wrote:
For how long does the cpu remain high after you get the `gremlin>` prompt?

Robert Dale

On Thu, Aug 17, 2017 at 2:20 PM, 'Ray Scott' via JanusGraph users list <janu...@...> wrote:
Hi, 

I'm trying to get JanusGraph running on my laptop (MacBook Air 2 Core Intel i7, 8GB) so that I can develop a small working prototype. 

I've gone for the janusgraph.sh method of starting the server and everything is fine until I open a gremlin shell. Then I see the CPU usage for my terminal rocket up past 350%. Once I close the gremlin shell, the CPU usage remains at the same high level, indefinitely. I've tried launching with the Berkeley DB option and no Elastic Search, but I get the exact same behaviour. 

Is there something I can do to stop this from using most of my CPU, or do I just have to live with it? 

Is there a "lite" recommended setup that someone has had success with in the past? 

Thanks. 

--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Performance issues on a laptop.

Robert Dale <rob...@...>
 

Maybe search to see if there's a known issue with running Cassandra on a MacBook. You could upgrade Cassandra if that's what you need. I believe JanusGraph is known to work with all current versions.

Robert Dale

On Fri, Aug 18, 2017 at 12:06 PM, 'Ray Scott' via JanusGraph users list <janusgra...@...> wrote:
As soon as I killed the cassandra process, the CPU usage plummeted. So at least I know who the culprit was. I'll just start Gremlin Server directly (configured to us BDB), instead of using JanusGraph.sh. 


On Thursday, 17 August 2017 19:22:51 UTC+1, Robert Dale wrote:
For how long does the cpu remain high after you get the `gremlin>` prompt?

Robert Dale

On Thu, Aug 17, 2017 at 2:20 PM, 'Ray Scott' via JanusGraph users list <janu...@...> wrote:
Hi, 

I'm trying to get JanusGraph running on my laptop (MacBook Air 2 Core Intel i7, 8GB) so that I can develop a small working prototype. 

I've gone for the janusgraph.sh method of starting the server and everything is fine until I open a gremlin shell. Then I see the CPU usage for my terminal rocket up past 350%. Once I close the gremlin shell, the CPU usage remains at the same high level, indefinitely. I've tried launching with the Berkeley DB option and no Elastic Search, but I get the exact same behaviour. 

Is there something I can do to stop this from using most of my CPU, or do I just have to live with it? 

Is there a "lite" recommended setup that someone has had success with in the past? 

Thanks. 

--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Performance issues on a laptop.

Ray Scott <raya...@...>
 

As soon as I killed the cassandra process, the CPU usage plummeted. So at least I know who the culprit was. I'll just start Gremlin Server directly (configured to us BDB), instead of using JanusGraph.sh. 


On Thursday, 17 August 2017 19:22:51 UTC+1, Robert Dale wrote:
For how long does the cpu remain high after you get the `gremlin>` prompt?

Robert Dale

On Thu, Aug 17, 2017 at 2:20 PM, 'Ray Scott' via JanusGraph users list <janusgra...@googlegroups.com> wrote:
Hi, 

I'm trying to get JanusGraph running on my laptop (MacBook Air 2 Core Intel i7, 8GB) so that I can develop a small working prototype. 

I've gone for the janusgraph.sh method of starting the server and everything is fine until I open a gremlin shell. Then I see the CPU usage for my terminal rocket up past 350%. Once I close the gremlin shell, the CPU usage remains at the same high level, indefinitely. I've tried launching with the Berkeley DB option and no Elastic Search, but I get the exact same behaviour. 

Is there something I can do to stop this from using most of my CPU, or do I just have to live with it? 

Is there a "lite" recommended setup that someone has had success with in the past? 

Thanks. 

--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Spark connector

Takao Magoori <ma...@...>
 

Hi Marc,

Thank you!
But I don't understand what you mean, sorry.
I feel SparkGraphComputer is "OLAP by gremlin on top of spark distributed power". But I want "OLAP by spark using janusGraph data".

So, I want to run "spark-submit", create pyspark sparkContext, load JanusGraph data into DataFrame. Then, I can use spark Dataframe, spark ML and python machine-learning packages.
The following pseudo-code is what really I want. (like https://github.com/sbcd90/spark-orientdb)
If there is no such solution, I guess I have to "dump whole graph into csv and read it from pyspark".

--------
spark_session = SparkSession.builder.appName('test').enableHiveSupport().getOrCreate()

df_user = spark_session.read.format(
    'org.apache.janusgraph.some_spark_gremlin_connector',
).options(
    url='url',
    query='g.V().hasLabel("user").has("age", gt(29)).valueMap("user_id", "name" "age")',
).load().dropna().join(
    other=some_df,
)


df_item = spark_session.read.format(
    'org.apache.janusgraph.some_spark_gremlin_connector',
).options(
    url='url',
    query='g.V().hasLabel("user").has("age", gt(29)).out("buy").hasLabel("item").valueMap("item_id", "name")',
).load().dropna()


df_sale = spark_session.read.format(
    'org.apache.janusgraph.some_spark_gremlin_connector',
).options(
    url='url',
    query='g.V().hasLabel("user").has("age", gt(29)).outE("buy").valueMap("timestamp")',
).load().select(
    col('item_id'),
    col('name'),
).dropna()
--------


2017年8月18日金曜日 4時08分02秒 UTC+9 HadoopMarc:

Hi Takao,

Only some directions. If you combine:

http://yaaics.blogspot.nl/              (using CassandraInputFormat in your case)
http://tinkerpop.apache.org/docs/current/reference/#interacting-with-spark

it should be possible to access the PersistedInputRDD alias graphRDD from the Spark object.
Never done this myself, I would be interested to read if this works! Probably you will need to run an OLAP query with SparkGraphComputer anyway (e.g. g.V()) to have the PersistedInputRDD realized (RDD's are not realized until a spark action is run on them.)

Cheers,     Marc


Op donderdag 17 augustus 2017 16:25:42 UTC+2 schreef Takao Magoori:
I have a JanusGraph Server (github master, gremlin 3.2.5) on top of Cassandra storage backend, to store users, items and "WHEN, WHERE, WHO bought WHAT ?" relations.
To get data from and modify data in the graph, I use Python aiogremlin driver-mode (== groovy sessionless eval mode) and it works well for now. Thanks developers !

So now, I have to compute recommendation and forecast item sales.
In order to data-cleaning, data-normalization, recommendation and forecasting, Because of a little big graph, I want to use higher-level pyspark tools (ex. DataFrame, ML) and python machine learning packages (ex, scikit-learn). But I can not find the way to load graph data into Spark. What I want is "connector" which can be used by pyspark to load data from JanusGraph, not SparkGraphComputer.

Could someone please how to do it ?


- Additional info
It seems OrientDB has some Spark connectors (though, I don't know these can be used by pyspark). But I want JanusGraph's one.


Re: Spark connector

HadoopMarc <bi...@...>
 

Hi Takao,

Only some directions. If you combine:

http://yaaics.blogspot.nl/              (using CassandraInputFormat in your case)
http://tinkerpop.apache.org/docs/current/reference/#interacting-with-spark

it should be possible to access the PersistedInputRDD alias graphRDD from the Spark object.
Never done this myself, I would be interested to read if this works! Probably you will need to run an OLAP query with SparkGraphComputer anyway (e.g. g.V()) to have the PersistedInputRDD realized (RDD's are not realized until a spark action is run on them.)

Cheers,     Marc


Op donderdag 17 augustus 2017 16:25:42 UTC+2 schreef Takao Magoori:

I have a JanusGraph Server (github master, gremlin 3.2.5) on top of Cassandra storage backend, to store users, items and "WHEN, WHERE, WHO bought WHAT ?" relations.
To get data from and modify data in the graph, I use Python aiogremlin driver-mode (== groovy sessionless eval mode) and it works well for now. Thanks developers !

So now, I have to compute recommendation and forecast item sales.
In order to data-cleaning, data-normalization, recommendation and forecasting, Because of a little big graph, I want to use higher-level pyspark tools (ex. DataFrame, ML) and python machine learning packages (ex, scikit-learn). But I can not find the way to load graph data into Spark. What I want is "connector" which can be used by pyspark to load data from JanusGraph, not SparkGraphComputer.

Could someone please how to do it ?


- Additional info
It seems OrientDB has some Spark connectors (though, I don't know these can be used by pyspark). But I want JanusGraph's one.


Re: Performance issues on a laptop.

Ray Scott <raya...@...>
 

Actually it looks like it's doing it right now after starting up the server and not even opening a gremlin shell. It's been high for 5 minutes. But I've not really timed it to see how long it continues. My machine starts to cook so I stop the server. 


On Thursday, 17 August 2017 19:22:51 UTC+1, Robert Dale wrote:
For how long does the cpu remain high after you get the `gremlin>` prompt?

Robert Dale

On Thu, Aug 17, 2017 at 2:20 PM, 'Ray Scott' via JanusGraph users list <janusgra...@googlegroups.com> wrote:
Hi, 

I'm trying to get JanusGraph running on my laptop (MacBook Air 2 Core Intel i7, 8GB) so that I can develop a small working prototype. 

I've gone for the janusgraph.sh method of starting the server and everything is fine until I open a gremlin shell. Then I see the CPU usage for my terminal rocket up past 350%. Once I close the gremlin shell, the CPU usage remains at the same high level, indefinitely. I've tried launching with the Berkeley DB option and no Elastic Search, but I get the exact same behaviour. 

Is there something I can do to stop this from using most of my CPU, or do I just have to live with it? 

Is there a "lite" recommended setup that someone has had success with in the past? 

Thanks. 

--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Performance issues on a laptop.

Robert Dale <rob...@...>
 

For how long does the cpu remain high after you get the `gremlin>` prompt?

Robert Dale

On Thu, Aug 17, 2017 at 2:20 PM, 'Ray Scott' via JanusGraph users list <janusgra...@...> wrote:
Hi, 

I'm trying to get JanusGraph running on my laptop (MacBook Air 2 Core Intel i7, 8GB) so that I can develop a small working prototype. 

I've gone for the janusgraph.sh method of starting the server and everything is fine until I open a gremlin shell. Then I see the CPU usage for my terminal rocket up past 350%. Once I close the gremlin shell, the CPU usage remains at the same high level, indefinitely. I've tried launching with the Berkeley DB option and no Elastic Search, but I get the exact same behaviour. 

Is there something I can do to stop this from using most of my CPU, or do I just have to live with it? 

Is there a "lite" recommended setup that someone has had success with in the past? 

Thanks. 

--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Performance issues on a laptop.

Ray Scott <raya...@...>
 

Hi, 

I'm trying to get JanusGraph running on my laptop (MacBook Air 2 Core Intel i7, 8GB) so that I can develop a small working prototype. 

I've gone for the janusgraph.sh method of starting the server and everything is fine until I open a gremlin shell. Then I see the CPU usage for my terminal rocket up past 350%. Once I close the gremlin shell, the CPU usage remains at the same high level, indefinitely. I've tried launching with the Berkeley DB option and no Elastic Search, but I get the exact same behaviour. 

Is there something I can do to stop this from using most of my CPU, or do I just have to live with it? 

Is there a "lite" recommended setup that someone has had success with in the past? 

Thanks. 


Re: Graph Databases and Framework Meetup Group

Ioannis <idu...@...>
 

Fantastic, I was not aware that there is such an established community. Thank you for the detailed list. My "graph" queries :) at meetup.com did not return any of them (apart from the first one). Since I already paid to setup the meetup group I will keep it for next few months and then consider consolidating.

Warm regards,
Ioannis

On Wed, Aug 16, 2017 at 8:18 PM, Lynn Bender <ly...@...> wrote:
Dear All,

There already are several Graph Database meetups in the Bay Area:


Silicon Valley Graph Database Meetup

Bay Area Knowledge Graphs

San Francisco Graph Database Meetup

These groups are managed by the organization that sponsored the recent Graph Day conference in San Francisco.

Anyone who speaks at one of our groups will get a FREE ticket to the next Graph Day conference.
Any company who hosts one of our meetups will get a FREE ticket to next Graph Day conference to distribute however they wish (except raffling off publicly).

We are also organizing the upcoming Graph Day at Data Day Seattle.

We encourage you to submit a talk on Janus -- we still have a few slots left.

Kind regards,


On Wed, Aug 16, 2017 at 3:58 PM, Ioannis Papapanagiotou <idu...@...> wrote:
Dear all,

I created a generic meetup group for Graph databases and frameworks in the Bay Area. We (at Netflix) are thinking of hosting a meetup at some point in the near future so kicking the tires.

Thank you,
Ioannis Papapanagiotou

--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



Re: JanusGraph and Cassandra modes.

Robert Dale <rob...@...>
 


I know it's a little confusing but the page you point to is from the perspective of Cassandra. The architecture page has a some info on embedded vs. remote JanusGraph - http://docs.janusgraph.org/latest/arch-overview.html

Basically, where you open JanusGraph is where graph processing takes place. Between there and the backend storage is where lots of IO (e.g. network) will take place.

Gremlin Server is a good option when 
-- as you mentioned, having non-JVM-based languages access the graph
-- if you want to separate resources of client processing, graph processing, and backend storage/indexing
-- if you want to separate your dependencies from implementation - e.g. depending on tinkerpop, not janusgraph  would allow you to swap out graph implementations without changing client code
-- if you want to separate concerns of maintenance of components
-- security architecture only allows client access over port 80/443 and not directly to backend databases
-- security policies (authz/authn) are different for accessing graph API vs. backend storage 

Robert Dale


On Wednesday, August 16, 2017 at 5:41:38 AM UTC-4, Manoj Waikar wrote:
Hi,

The Cassandra related JanusGraph documentation specifies various ways in which JanusGraph can be used in concert with Cassandra.

So, if I run Cassandra (on my machine using cassandra -f) and then from my Java / Scala code, if I do the following -
JanusGraph g = JanusGraphFactory.build().set("storage.backend", "cassandra").set("storage.hostname", "127.0.0.1").open();

Then -
  1. I am using the Local Server Mode.
  2. Whereas if Cassandra is running on another machine, and then if I replace 127.0.0.1 (localhost) with the IP of the server where Cassandra is running, then I am using the Remote Server Mode.
  3. Also, when Jason replied to my previous question 3, when he said "your application is creating an embedded graph instance" he didn't mean the JanusGraph Embedded Mode (because clearly, I am not running JanusGraph so no question of it running in the same JVM instance as Cassandra)?
Is my understanding correct?

So, when is the Remote Server Mode with Gremlin Server useful? Is it useful when non-Java based applications would like to communicate with Gremlin server?

Also, if I have to host a web application (written in Java / Scala, on my own server) which stores data in Cassandra, then which mode is best? Is it the local / remote server mode depending on where Cassandra resides with respect to the web server?

Thanks in advance for the replies / help.


Re: Can we create a new API based on JanusGraph?

Robert Dale <rob...@...>
 


To which API are you referring?

There is the graph API - http://tinkerpop.apache.org/docs/current/reference/#_the_graph_structure
There is the traversal API (preferred) - http://tinkerpop.apache.org/docs/current/reference/#traversal

Then there are ways of accessing the API:
- Direct graph access (embedded) - graph = JanusGraphFactory.open('conf/janusgraph-cassandra-es.properties')
- Script transport over HTTP - http://tinkerpop.apache.org/docs/current/reference/#_connecting_via_rest
- Script transport over WebSocket - http://tinkerpop.apache.org/docs/current/reference/#connecting-via-java
- Traversal (bytecode) transport over WebSocket - http://tinkerpop.apache.org/docs/current/reference/#connecting-via-remotegraph


On Wednesday, August 16, 2017 at 8:56:06 PM UTC-4, st...@... wrote:
Hello Rafael
I've built the API use JanusGraph Server, and i have successfully access the api with
```
```
In gremlin console i can index my vertex with 
```
 g.V().has('id_number', '3207221555223216')  
```
I want to know how to get the same result with API
在 2017年8月14日星期一 UTC+8下午11:06:52,Rafael Fernandes写道:
no need my friend, just use JanusGraph Server...

rafael fernandes

On Sunday, August 13, 2017 at 11:27:57 PM UTC-4, hu junjie wrote:
I mean open new API to get the customized data or post formatted data and load to JanusGraph.


Re: How do I get out of a continuation line (?) in Gremlin

Amyth Arora <aroras....@...>
 

Another shortcut be :c


On Wednesday, 16 August 2017 05:35:28 UTC+5:30, Rohit Jain wrote:
You are so quick Robert!! :-)

I actually figured it out since I found that :h worked in that situation and then I saw :c.  But you beat me to my posting that I had found the solution to the problem.

Thanks!!
Rohit 


Spark connector

Takao Magoori <ma...@...>
 

I have a JanusGraph Server (github master, gremlin 3.2.5) on top of Cassandra storage backend, to store users, items and "WHEN, WHERE, WHO bought WHAT ?" relations.
To get data from and modify data in the graph, I use Python aiogremlin driver-mode (== groovy sessionless eval mode) and it works well for now. Thanks developers !

So now, I have to compute recommendation and forecast item sales.
In order to data-cleaning, data-normalization, recommendation and forecasting, Because of a little big graph, I want to use higher-level pyspark tools (ex. DataFrame, ML) and python machine learning packages (ex, scikit-learn). But I can not find the way to load graph data into Spark. What I want is "connector" which can be used by pyspark to load data from JanusGraph, not SparkGraphComputer.

Could someone please how to do it ?


- Additional info
It seems OrientDB has some Spark connectors (though, I don't know these can be used by pyspark). But I want JanusGraph's one.
https://github.com/sbcd90/spark-orientdb
https://github.com/metreta/spark-orientdb-connector


Re: Graph Databases and Framework Meetup Group

Lynn Bender <ly...@...>
 

Dear All,

There already are several Graph Database meetups in the Bay Area:


Silicon Valley Graph Database Meetup

Bay Area Knowledge Graphs

San Francisco Graph Database Meetup

These groups are managed by the organization that sponsored the recent Graph Day conference in San Francisco.

Anyone who speaks at one of our groups will get a FREE ticket to the next Graph Day conference.
Any company who hosts one of our meetups will get a FREE ticket to next Graph Day conference to distribute however they wish (except raffling off publicly).

We are also organizing the upcoming Graph Day at Data Day Seattle.

We encourage you to submit a talk on Janus -- we still have a few slots left.

Kind regards,

On Wed, Aug 16, 2017 at 3:58 PM, Ioannis Papapanagiotou <idu...@...> wrote:
Dear all,

I created a generic meetup group for Graph databases and frameworks in the Bay Area. We (at Netflix) are thinking of hosting a meetup at some point in the near future so kicking the tires.

Thank you,
Ioannis Papapanagiotou

--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

6041 - 6060 of 6663