Google Dataflow template job not scaling when writing records to Google datastore

The job is basically reading from a table in Bigquery, converts the resultant Tablerow to a Key-Value, and writes the Key-Value to Datastore.
The code is :
PCollection bigqueryResult = p.apply("BigQueryRead",
BigQueryIO.readTableRows().withTemplateCompatibility()
.fromQuery(options.getQuery()).usingStandardSql()
.withoutValidation());

bigqueryResult.apply("WriteFromBigqueryToDatastore", ParDo.of(new DoFn<TableRow, String>() {
@ProcessElement
public void processElement(ProcessContext pc) {
TableRow row = pc.element();

            Datastore datastore = DatastoreOptions.getDefaultInstance().getService();
            KeyFactory keyFactoryCounts = datastore.newKeyFactory().setNamespace("MyNamespace")
                    .setKind("MyKind");

            Key key = keyFactoryCounts.newKey("Key");
            Builder builder =   Entity.newBuilder(key);
            builder.set("Key", BooleanValue.newBuilder("Value").setExcludeFromIndexes(true).build());   

            Entity entity= builder.build();
            datastore.put(entity);
        }
    }));

This pipeline runs fine when the number of records I try to process is anywhere in the range of 1 to 100. However, when I try putting more load on the pipeline, ie, ~10000 records, the pipeline does not scale (eventhough autoscaling is set to THROUGHPUT based and maximumWorkers is specified to as high as 50 with an n1-standard-1 machine type). The job keeps processing 3 or 4 elements per second with one or two workers. This is impacting the performance of my system.
Any advice on how to scale up the performance is very welcome. Thanks in advance.

Comments

  • userEntity.apply("WriteToDatastore", DatastoreIO.v1().write().withProjectId(options.getProject()));
    This solution was able to scale from 3 elements per second with 1 worker to ~1500 elements per second with 20 workers.

Sign In or Register to comment.