Questions tagged [scala]

Scala is a high level language that combines functional and object oriented programming with high performance runtimes. Scala programming language is build to implement scale able solutions to crunch big data / data science in order to produce actionable insights is a great languages for large-scale projects. Designed to express common programming patterns in a concise, elegant, and type-safe way, it fuses both imperative and functional programming styles.

47 questions

votes

3 answers

How to calculate the mean of a dataframe column and find the top 10%

I am very new to Scala and Spark, and am working on some self-made exercises using baseball statistics. I am using a case class create a RDD and assign a schema to the data, and am then turning it into a DataFrame so I can use SparkSQL to select…

apache-spark scala

asked Jul 22 '15 at 14:16

the3rdNotch

votes

4 answers

Data Science Tools Using Scala

I know that Spark is fully integrated with Scala. It's use case is specifically for large data sets. Which other tools have good Scala support? Is Scala best suited for larger data sets? Or is it also suited for smaller data sets?

scalability scala

asked Dec 10 '14 at 06:37

sheldonkreger

1,169
8
20

votes

3 answers

Summarize and visualize a CSV in Java/Scala?

I would like to summarize (as in R) the contents of a CSV (possibly after loading it, or storing it somewhere, that's not a problem). The summary should contain the quartiles, mean, median, min and max of the data in a CSV file for each numeric…

tools visualization scala csv

asked Aug 28 '14 at 01:36

Trylks

votes

1 answer

Saving Large Spark ML Pipeline to HDFS

I'm having trouble saving a large (relative to spark.rpc.message.maxSize) Spark ML pipeline to HDFS. Specifically, when I try to save the model to HDFS, it gives me an error related to spark's maximum message size: scala> val mod =…

apache-spark apache-hadoop scala

asked Jan 08 '18 at 16:19

Thomas Cleberg

1,525
7
22

votes

2 answers

What is the best deep learning library for scala?

Does any one has a recommendation for what libraries to use for deep learning?

deep-learning scala

asked Dec 22 '16 at 18:58

Soerendip

votes

2 answers

Distributed k-means in Spark

I want to implement K-means algorithm in Spark. I am looking for a starting point and I found Berkeley's naive implementation. However, is that distributed? I mean I see no mapreduce operations. Or maybe, when submitted in Spark, the framework…

clustering k-means apache-spark distributed scala

asked Feb 10 '16 at 22:53

gsamaras

votes

1 answer

Sampling with replacement, specify the probabilities

I am trying to do sampling with replacement in Scala/Spark, defining the probabilities for each class. This is how I would do it in R. # Vector to sample from x <- c("User1","User2","User3","User4","User5") # Occurenciens from which to obtain…

r apache-spark scala

asked Dec 18 '15 at 16:19

Stefano

votes

1 answer

Plotting libraries for Scala on Zeppelin

My main question is it looks like Zeppelin limit the display of the results to on 1000, I know that I can change this number but when I change it Zeppelin become slow. And it looks like the default plotting tool of Zeppelin also plot the first 1000…

visualization scala

asked Nov 21 '15 at 23:02

Rami

votes

1 answer

Has anyone succeeded in finding a good Scala/Spark kernel for Jupyter?

The ones I've tried so far Almond: Works very well for just Scala, but you have to import dependencies, and it gets tedious after a while. And unfortunately can't run when using Spark with YARN instead of Local. Spylon-kernel: Kernels connects, but…

apache-spark jupyter kernel scala

asked Aug 18 '20 at 12:38

Varun Gawande

votes

2 answers

How Mllib in Spark select variables in logistic regression

I have a question about MLlib in Spark.(with Scala) I'm trying to understand how LogisticRegressionWithLBFGS and LogisticRegressionWithSGD work. I usually use SAS or R to do logistic regressions but I now have to do it on Spark to be able to analyze…

machine-learning bigdata logistic-regression scala apache-spark

asked May 04 '15 at 13:26

SparkUser

votes

1 answer

How to set up multi cluster spark without hadoop on Google Compute engine

I'm new to apache spark. Is it possible to configure multi cluster spark without hadoop? If so, can you please provide the steps. I would like to create clusters on Google Compute Engine (1-master, 1-worker)

bigdata apache-hadoop scala

asked Dec 07 '14 at 16:31

user4290511

votes

3 answers

Scala vs Java if you're NOT going to use Spark?

I'm facing some indecision when choosing how to allocate my scarce learning time for the next few months between Scala and Java. I would like help objectively understanding the practical tradeoffs. The reason I am interested in Java is that I think…

java scala

asked Jun 07 '16 at 15:32

Hack-R

1,949
1
21
34

votes

3 answers

How to install Polynote on Windows?

I've been searching around the Internet for a while but I have not been able to find detailed instructions on how to install Polynote (the polyglot notebook with first-class Scala support) for Windows with mixing multiple languages, Python and…

python ipython scala windows

asked Nov 05 '19 at 06:11

Pluviophile

4,203
14
32
56

votes

2 answers

Hashing trick with random forest in scala

I am trying to perform a hashing trick and then a random forest with scala. I have the following code: val documents: RDD[Seq[String]] = sc.textFile("hdfs:///tmp/new_cromosoma12v2.csv").map(_.split(",").toSeq) val hashingTF = new HashingTF() val…

random-forest apache-spark scala

asked Sep 22 '16 at 08:34

keira

votes

3 answers

Task not serializable Error

import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.sql.cassandra.CassandraSQLContext object Test { val sparkConf = new SparkConf(true).set("spark.cassandra.connection.host", ) val…

bigdata apache-spark scala

asked Sep 14 '16 at 12:56

Credosam

2 3 4 Next